scylladb

Author	SHA1	Message	Date
copilot-swe-agent[bot]	d01358cecd	Add tests and documentation for GROUP0 BATCH Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>	2025-12-05 18:50:24 +00:00
copilot-swe-agent[bot]	35830b34df	Add GROUP0 BATCH statement implementation Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>	2025-12-05 18:47:38 +00:00
copilot-swe-agent[bot]	5bc015549f	Initial plan	2025-12-05 18:34:00 +00:00
Botond Dénes	866c96f536	Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk This pull request adds support for calculation and storing CRC32 digests for all SSTable components. This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream. All important SSTable components (Index, Partitions, Rows, Summary, Filter, CompressionInfo, and TOC) are covered. Several test cases where introduced to verify expected behaviour. Backport is not required, it is a new feature Fixes #20100 Closes scylladb/scylladb#27287 * github.com:scylladb/scylladb: sstable_test: add verification testcases of SSTable components digests persistance sstables: store digest of all sstable components in scylla metadata sstables: Add TemporaryScylla metadata component type sstables: Extract file writer closing logic into separate methods sstables: Add components_digests to scylla metadata components sstables: Implement CRC32 digest-only writer	2025-12-05 11:36:50 +02:00
Botond Dénes	367633270a	Merge 'EAR: handle IPV6 hosts in KMIP and use shared (improved) http parser in AWS/Azure' from Calle Wilund Fixes #27367 Fixes #27362 Fixes #27366 Makes http URL parser handle IPv6. Makes KMIP host setup handle IPv6 hosts + use system trust if no truststore set Moves Azure/KMS code to use shared http URL parser to avoid same regex everywhere. Closes scylladb/scylladb#27368 * github.com:scylladb/scylladb: ear::kms/ear::azure: Use utils::http URL parsing ear::kmip_host: Handle ipv6 hosts + use system trust when not specified utils::http: Handle ipv6 numeric host part in URL:s	2025-12-05 10:43:07 +02:00
Asias He	e97a504775	repair: Allow min max range to be updated for repair history It is observed that: repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb: seastar::rpc::remote_verb_error (repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum token,maximum token) is not in the format of (start, end]) This is because repair checks the end of the range to be repaired needs to be inclusive. When small_table_optimization is enabled for regular repair, a (minimum token,maximum token) will be used. To fix, we can relax the check of (start, end] for the min max range. Fixes #27220 Closes scylladb/scylladb#27357	2025-12-05 10:41:25 +02:00
Anna Stuchlik	a5c971d21c	doc: update the upgrade policy to cover non-consecutive minor upgrades Fixes https://github.com/scylladb/scylladb/issues/27308 Closes scylladb/scylladb#27319	2025-12-05 10:31:53 +02:00
Guy Shtub	a0809f0032	Update integration-jaeger.rst Fixing broken link in Jaeger Docs to ScyllaDB Closes scylladb/scylladb#26406	2025-12-05 10:23:07 +02:00
Piotr Dulikowski	bb6e41f97a	index: allow vector indexes without rf_rack_valid_keyspces The rf_rack_valid_keyspaces option needs to be turned on in order to allow creating materialized views in tablet keyspaces with numeric RF per DC. This is also necessary for secondary indexes because they use materialized views underneath. However, this option is _not_ necessary for vector store indexes because those use the external vector store service for querying the list of keys to fetch from the main table, they do not create a materialized view. The rf_rack_valid_keyspaces was, by accident, required for vector indexes, too. Remove the restriction for vector store indexes as it is completely unnecessary. Fixes: SCYLLADB-81 Closes scylladb/scylladb#27447	2025-12-05 09:26:26 +02:00
Marcin Maliszkiewicz	4df6b51ac2	auth: fix cache::prune_all roles iteration During `b9199e8b24` reivew it was suggested to use standard for loop but when erasing element it causes increment on invalid iterator, as role could have been erased before. This change brings back original code. Fixes: https://github.com/scylladb/scylladb/issues/27422 Backport: no, offending commit not released yet Closes scylladb/scylladb#27444	2025-12-04 23:35:54 +01:00
Taras Veretilnyk	0c8730ba05	sstable_test: add verification testcases of SSTable components digests persistance Adds a generic test helper that writes a random SSTable, reloads it, and verifies that the persisted CRC32 digest for each component matches the digest computed from disk. Those covers all checksummed components test cases.	2025-12-04 21:09:01 +01:00
Taras Veretilnyk	bc2e83bc1f	sstables: store digest of all sstable components in scylla metadata This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.	2025-12-04 21:00:09 +01:00
Patryk Jędrzejczak	f4c3d5c1b7	Merge 'fix test_coordinator_queue_management flakiness' from Gleb Natapov After `39cec4ae45` node join may fail with either "request canceled" notification or (very rarely) because it was banned. Depend on timing. The series fixes the test to check for both possibilities. Fixes #27320 No need to backport since the flakiness is in the mater only. Closes scylladb/scylladb#27408 * https://github.com/scylladb/scylladb: test: fix test_coordinator_queue_management flakiness test/pylib: allow expected_error in server_start to contain regular expression	2025-12-04 16:08:02 +01:00
Tomasz Grabiec	e54abde3e8	Merge 'main: delay setup of storage_service REST API' from Andrzej Jackowski The storage_service REST API uses `group0` internally. Before this patch, it was possible to send an HTTP request before `group0` was initialized, which resulted in a segmentation fault. Therefore, this patch delays the setup of the storage_service REST API. Additionally, `test_rest_api_on_startup` is added to reproduce the problem. Fixes: https://github.com/scylladb/scylladb/issues/27130 No backport. It's a crash fix but possible only if a request is sent in a very specific phase of a node start. Closes scylladb/scylladb#27410 * github.com:scylladb/scylladb: test: add test_rest_api_on_startup main: delay setup of storage_service REST API	2025-12-04 14:56:49 +01:00
Avi Kivity	9696ee64d0	database: fix overflow when computing data distribution over shards We store the per-shard chunk count in a uint64_t vector global_offset, and then convert the counts to offsets with a prefix sum: ```c++ // [1, 2, 3, 0] --> [0, 1, 3, 6] std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus()); ``` However, std::exclusive_scan takes the accumulator type from the initial value, 0, which is an int, instead of from the range being iterated, which is of uint64_t. As a result, the prefix sum is computed as a 32-bit integer value. If it exceeds 0x8000'0000, it becomes negative. It is then extended to 64 bits and stored. The result is a huge 64-bit number. Later on we try to find an sstable with this chunk and fail, crashing on an assertion. An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57 The fix is simple: the initial value is passed as uint64_t instead of int. Fixes https://github.com/scylladb/scylladb/issues/27417 Closes scylladb/scylladb#27418	2025-12-04 14:10:53 +01:00
Calle Wilund	8dd69f02a8	ear::kms/ear::azure: Use utils::http URL parsing Fixes #27367 Move to reuse shared code.	2025-12-04 11:38:41 +00:00
Calle Wilund	d000fa3335	ear::kmip_host: Handle ipv6 hosts + use system trust when not specified Fixes #27362 The KMIP host connector should handle ipv4 connections (named or numeric). It also should fall back to system trust when truststore is not specified.	2025-12-04 11:38:41 +00:00
Calle Wilund	4e289e8e6a	utils::http: Handle ipv6 numeric host part in URL:s Fixes #27366 A URL with numeric host part formats special in case of ipv6, to avoid confusion with port part. The parser should handle this. I.e. http://[2001:db8:4006:812::200e]:8080 v2: * Include scheme agnostic parse + case insensitive scheme matching	2025-12-04 11:38:41 +00:00
Botond Dénes	9d2f7c3f52	Merge 'mv: allow setting concurrency in PRUNE MATERIALIZED VIEW' from Wojciech Mitros The PRUNE MATERALIZED VIEW statement is performed as follows: 1. Perform a range scan of the view table from the view replicas based on the ranges specified in the statement. 2. While reading the paged scan above, for each view row perform a read from all base replicas at the corresponding primary key. If a discrepancy is detected, delete the row in the view table. When reading multiple rows, this is very slow because for each view row we need to performe a single row query on multiple replicas. In this patch we add an option to speed this up by performing many of the single base row reads concurrently, at the concurrency specified in the USING CONCURRENCY clause. Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were: * Pruning with concurrency 1 took total 1416 seconds * Pruning with concurrency 2 took total 731 seconds * Pruning with concurrency 10 took total 234 seconds * Pruning with concurrency 100 took total 171 seconds So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE Fixes https://github.com/scylladb/scylladb/issues/27070 Closes scylladb/scylladb#27097 * github.com:scylladb/scylladb: mv: allow setting concurrency in PRUNE MATERIALIZED VIEW cql: add CONCURRENCY to the USING clause	2025-12-04 11:47:41 +02:00
Aleksandra Martyniuk	e3e81a9a7a	repair: throw if flush failed in get_flush_time Currently, _flush_time was stored as a std::optional<gc_clock::time_point> and std::nullopt indicates that the flush was needed but failed. It's confusing for the caller and does not work as expected since the _flush_time is initialized with value (not optional). Change _flush_time type to gc_clock::time_point. If a flush is needed but failed, get_flush_time() throws an exception. This was suppose to be a part of https://github.com/scylladb/scylladb/pull/26319 but it was mistakenly overwritten during rebases. Refs: https://github.com/scylladb/scylladb/issues/24415. Closes scylladb/scylladb#26794	2025-12-04 11:45:53 +02:00
Avi Kivity	b82f92b439	main: replace p11-kit hack for trust paths override with gnutls hack p11-kit has hardcoded paths for the trust paths. Of course, each Linux distribution hardcodes those paths differently. As a result, our relocatable gnutls, which uses p11-kit-trust.so to process the trust paths, needs some overrides to select the right paths. Currently, we use p11_kit_override_system_files(), a p11-kit API intended for testing, but which worked well enough for our purpose, to override the trust module configuration. Unfortunately, starting (presumably [1]) in gnutls 3.8.11, gnutls changed how it works with p11-kit and our override is now ignored. This was likely unintentional, but there appears to be a better way: instead of letting gnutls auto-load the trust module from a hacked configuration, we load the modules outselves using gnutls_pkcs11_init(GNUTLS_PKCS11_FLAG_MANUAL) and gnutls_pkcs11_add_provider(). These appear to be intended for the purpose. We communicate the paths to the scylla executable using an environment variable. This isn't optimal, but is much easier than adding a command line variable since there are multiple levels of command line parsing due to the subtool mechanism. With this, we unlock the possibility to upgrade gnutls to newer versions. [1] `aa5f15a872` Closes scylladb/scylladb#27348	2025-12-04 11:33:51 +02:00
Gleb Natapov	f00e00fde0	test: fix test_coordinator_queue_management flakiness After `39cec4ae45` node join may fail with either "request canceled" notification or (very rarely) because it was banned. Depend on timing. The patch fixes the test to check for both possibilities.	2025-12-04 11:06:20 +02:00
Gleb Natapov	b0727d3f2a	test/pylib: allow expected_error in server_start to contain regular expression Currently expected_error parameter to server_start can only work with exact matches. Change it to support regular expressions.	2025-12-04 11:06:20 +02:00
Calle Wilund	4169bdb7a6	encryption::gcp_host: Add exponential retry for server errors Fixes #27242 Similar to AWS, google services may at times simply return a 503, more or less meaning "busy, please retry". We rely for most cases higher up layers to handle said retry, but we cannot fully do so, because both we reach this code sometimes through paths that do no such thing, and also because it would be slightly inefficient, since we'd like to for example control the back-off for auth etc. This simply changes the existing retry loop in gcp_host to be a little more forgiving, special case 503 errors and extend the retry to the auth part, as well as re-use the exponential_backoff_retry primitive. v2: * Avoid backoff if refreshing credentials. Should not add latency due to this. * Only allow re-auth once per (non-service-failure-backoff) try. * Add abort source to both request and retry v3: * Include timeout and other server errors in retry-backoff v4: * Reorder error code handling correctly Closes scylladb/scylladb#27267	2025-12-04 10:13:37 +02:00
Anna Stuchlik	c5580399a8	replace the Driver pages with a link to the new Drivers pages This commit removes the now redundant driver pages from the Scylla DB documentation. Instead, the link to the pages where we moved the diver information is added. Also, the links are updated across the ScyllaDB manual. Redirections are added for all the removed pages. Fixes https://github.com/scylladb/scylladb/issues/26871 Closes scylladb/scylladb#27277	2025-12-04 10:07:27 +02:00
Tomasz Grabiec	1d42770936	Merge 'topology_coordinator: Add barrier to cleanup_target' from Łukasz Paszkowski Consider the following scenario: 1. A table has RF=3 and writes use CL=QUORUM 2. One node is down 3. There is a pending tablet migration from the unavailable node that is reverted During the revert, there can be a time window where the pending replica being cleaned up still accepts writes. This leads to write failures, as only two nodes (out of four) are able to acknowledge writes. This patch fixes the issue by adding a barrier to the cleanup_target tablet transition state, ensuring that the coordinator switches back to the previous replica set before cleanup is triggered. Fixes https://github.com/scylladb/scylladb/issues/26512 It's a pre existing issue. Backport is required to all recent 2025.x versions. Closes scylladb/scylladb#27413 * github.com:scylladb/scylladb: topology_coordinator: Fix the indentation for the cleanup_target case topology_coordinator: Add barrier to cleanup_target test_node_failure_during_tablet_migration: Increase RF from 2 to 3	2025-12-03 23:57:45 +01:00
Taras Veretilnyk	d287b054b9	sstables: Add TemporaryScylla metadata component type Add TemporaryScylla component type to make atomic updates of SSTable Scylla metadata using temporary files and atomic rename operations possible. This will be needed in further commit to rewrite metadata together with the statistics component.	2025-12-03 23:40:10 +01:00
Szymon Wasik	4f803aad22	Improve documentation of vector search configuration parameters. This patch adds separate group for vector search parameters in the documentation and fixes small typos and formatting. Fixes: SCYLLADB-77. Closes scylladb/scylladb#27385	2025-12-03 21:02:59 +02:00
Karol Nowacki	a54bf50290	vector_search: Fix requests hanging on unreachable nodes When a vector store node becomes unreachable, a client request sent before the keep-alive timer fires would hang until the CQL query timeout was reached. This occurred because the HTTP request writes to the TCP buffer and then waits for a response. While data is in the buffer, TCP retransmissions prevent the keep-alive timer from detecting the dead connection. This patch resolves the issue by setting the `TCP_USER_TIMEOUT` socket option, which applies an effective timeout to TCP retransmissions, allowing the connection to fail faster. Closes scylladb/scylladb#27388	2025-12-03 21:01:43 +02:00
Nadav Har'El	06dd3b2e64	install-dependencies.sh: add zlib Scylla uses zlib, through the header <zlib.h>, in sstable compression. We also want to use it in Alternator for gzip-compressed requests. We never actually required zlib explicltly in install-dependencies.sh, we only get it through transitive dependencies. But it's better to require it explicitly so this is what we do in this patch. In Fedora, we use the newer, more efficient, zlib-ng which is API- compatible with the classic zlib. Unfortunately, the Debian zlib-ng package is not drop-in compatible with zlib (you need to include a different header file <zlib-ng.h>) so we use the classic zlib. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27238	2025-12-03 19:30:36 +02:00
Łukasz Paszkowski	6163fedd2e	topology_coordinator: Fix the indentation for the cleanup_target case	2025-12-03 16:37:33 +01:00
Łukasz Paszkowski	67f1c6d36c	topology_coordinator: Add barrier to cleanup_target Consider the following scenario: 1. A table has RF=3 and writes use CL=QUORUM 2. One node is down 3. There is a pending tablet migration from the unavailable node that is reverted During the revert, there can be a time window where the pending replica being cleaned up still accepts writes. This leads to write failures, as only two nodes (out of four) are able to acknowledge writes. This patch fixes the issue by adding a barrier to the cleanup_target tablet transition state, ensuring that the coordinator switches back to the previous replica set before cleanup is triggered. Fixes https://github.com/scylladb/scylladb/issues/26512	2025-12-03 16:19:17 +01:00
Łukasz Paszkowski	669286b1d6	test_node_failure_during_tablet_migration: Increase RF from 2 to 3 The patch prepares the test for additional write workload to be executed in parallel with node failures. With the original RF=2, QUORUM is also 2, which causes writes to fail during node outage. To address it, the third rack with a single node is added and the replication factor is increased to 3.	2025-12-03 16:00:19 +01:00
Botond Dénes	b9199e8b24	Merge 'auth: use auth cache on login path' from Marcin Maliszkiewicz Scylla currently has bad resiliency to connection storms. Nodes are easy to overload or impact their latency by unbound concurrency in making new connections on the client side. This can easily happen in bigger deployments where there are thousands of client instances, e.g. pods. To improve resiliency we are introducing unified auth specialized cache to the system. This patch series is stage 1, where cache is used only on login path. Dependency diagram: ``` \|Authentication Layer\| \| v +--------------------------------+ \| Auth Cache \| +--------------------------------+ ^ \| \| \| \| v \|Raft Write Logic \| \| CQL Read Layer\| ``` Cache invalidation is based on raft and the cache contains full content of related tables. Ldap role manager may benefit partially as can_logic function is common and will be cached, but it still needs to query roles from external source. Performance results: For single shard connection/disconnection scenario insns/conn decreased by 5%, allocs/conn decreased by 23%, tasks/conn decreased by 20%. Results for 20 shards are very similar. Raw data before: ``` ≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true 2> /dev/null Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0, auth, connection_per_request} Pre-populated 10000 partitions 1128.55 tps (599.2 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2586610 insns/op, 1350912 cycles/op, 0 errors) 1157.41 tps (601.3 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2589046 insns/op, 1356691 cycles/op, 0 errors) 1167.42 tps (603.3 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2603234 insns/op, 1360607 cycles/op, 0 errors) 1159.63 tps (605.9 allocs/op, 0.0 logallocs/op, 145.3 tasks/op, 2609977 insns/op, 1363935 cycles/op, 0 errors) 1165.12 tps (608.8 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2625804 insns/op, 1365736 cycles/op, 0 errors) throughput: mean= 1155.63 standard-deviation=15.66 median= 1159.63 median-absolute-deviation=9.49 maximum=1167.42 minimum=1128.55 instructions_per_op: mean= 2602934.31 standard-deviation=16063.01 median= 2603234.19 median-absolute-deviation=13887.96 maximum=2625804.05 minimum=2586609.82 cpu_cycles_per_op: mean= 1359576.30 standard-deviation=5945.69 median= 1360607.05 median-absolute-deviation=4358.94 maximum=1365736.42 minimum=1350912.10 ``` Raw data after: ``` ≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true --duration 10 2> /dev/null Running test with config: {workload=read, partitions=10000, concurrency=100, duration=10, ops_per_shard=0, auth, connection_per_request} Pre-populated 10000 partitions 1132.09 tps (457.5 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2432485 insns/op, 1270655 cycles/op, 0 errors) 1157.70 tps (458.4 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2447779 insns/op, 1283768 cycles/op, 0 errors) 1162.86 tps (459.0 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2463225 insns/op, 1291782 cycles/op, 0 errors) 1153.15 tps (460.2 allocs/op, 0.0 logallocs/op, 115.2 tasks/op, 2469230 insns/op, 1296381 cycles/op, 0 errors) 1142.09 tps (460.6 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2478900 insns/op, 1299342 cycles/op, 0 errors) 1124.89 tps (462.5 allocs/op, 0.0 logallocs/op, 115.2 tasks/op, 2470962 insns/op, 1305026 cycles/op, 0 errors) 1156.75 tps (464.4 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2493823 insns/op, 1305136 cycles/op, 0 errors) 1152.16 tps (466.3 allocs/op, 0.0 logallocs/op, 115.2 tasks/op, 2497246 insns/op, 1309816 cycles/op, 0 errors) 1154.77 tps (469.8 allocs/op, 0.0 logallocs/op, 115.5 tasks/op, 2571954 insns/op, 1345341 cycles/op, 0 errors) 1152.22 tps (472.4 allocs/op, 0.0 logallocs/op, 115.3 tasks/op, 2551954 insns/op, 1334202 cycles/op, 0 errors) throughput: mean= 1148.87 standard-deviation=12.08 median= 1153.15 median-absolute-deviation=7.88 maximum=1162.86 minimum=1124.89 instructions_per_op: mean= 2487755.88 standard-deviation=43838.23 median= 2478900.02 median-absolute-deviation=24531.06 maximum=2571954.26 minimum=2432485.38 cpu_cycles_per_op: mean= 1304144.76 standard-deviation=22129.55 median= 1305025.71 median-absolute-deviation=12363.25 maximum=1345341.16 minimum=1270655.17 ``` Fixes https://github.com/scylladb/scylladb/issues/18891 Backport: no, it's a new feature Closes scylladb/scylladb#26841 * github.com:scylladb/scylladb: auth: use auth cache on login path auth: corutinize standard_role_manager::can_login main: auth: add auth cache dependency to auth service raft: update auth cache when data changes auth: storage_service: reload auth cache on v1 to v2 auth migration raft: reload auth cache on snapshot application service: add auth cache getter to storage service main: start auth cache service auth: add unified cache implementation auth: move table names to common.hh	2025-12-03 16:45:01 +02:00
Andrzej Jackowski	1ff7f5941b	test: add test_rest_api_on_startup This test verifies that REST API requests are handled properly when a server is started or restarted. It is used to verify the fix for scylladb/scylladb#27130, where a server failed with a segmentation fault when `storage_service/raft_topology/reload` was called too early. Refs: scylladb/scylladb#27130	2025-12-03 15:35:59 +01:00
Andrzej Jackowski	3b70154f0a	main: delay setup of storage_service REST API The storage_service REST API uses `group0` internally. Before this patch, it was possible to send an HTTP request before `group0` was initialized, which resulted in a segmentation fault. Therefore, this patch delays the setup of the storage_service REST API. Fixes: scylladb/scylladb#27130	2025-12-03 15:35:54 +01:00
Pavel Emelyanov	6ae72ed134	test: Reuse S3 fixtures facilities in cqlpy/test_tools.py Creating endpoint conf can be made with the s3_server method Getting boto3 resource from s3_server itself is also possible Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27380	2025-12-03 16:32:54 +02:00
Michael Litvak	9213a163cb	test: fix test flakiness in test_colocated_tables_gc_mode The test executes a LWT query in order to create a paxos state table and verify the table properties. However, after executing the LWT query, the table may not exist on all nodes but only on a quorum of nodes, thus checking the properties of the table may fail if the table doesn't exist on the queried node. To fix that, execute a group0 read barrier to ensure the table is created on all nodes. Fixes scylladb/scylladb#27398 Closes scylladb/scylladb#27401	2025-12-03 12:12:24 +01:00
David Garcia	d9593732b1	docs: add strict mode to control metrics validation behavior The metrics extension now includes validation to detect missing metrics. This validation caused failures during multiversion publication because older versions did not generate all required properties. Instead of fixing each branch, a strict mode flag was introduced to control when validation should run. Strict mode is enabled in the workflow that validates pull requests, ensuring that new changes meet the expected metrics. During multiversion builds, validation errors are now logged but do not raise exceptions, which prevents build failures while still providing visibility into missing data. docs: verbose mode docs: verbose mode Closes scylladb/scylladb#27402	2025-12-03 14:09:08 +03:00
Anna Stuchlik	48cf84064c	doc: add the upgrade guide from 2025.x to 2025.4 Fixes https://github.com/scylladb/scylladb/issues/26451 Fixes https://github.com/scylladb/scylladb/issues/26452 Closes scylladb/scylladb#27310	2025-12-03 11:18:10 +03:00
Avi Kivity	a12165761e	Update seastar submodule * seastar b5c76d6b...7ec14e83 (5): > Merge 'reactor: coroutinize more file related functions' from Avi Kivity reactor: reindent after coroutinization reactor: fdatasync: coroutinize reactor: touch_directory: coroutinize reactor: make_directory: coroutinize reactor: open_directory: coroutinize reactor: statvfs: coroutinize reactor: fstatfs: coroutinize reactor: file_system_at: coroutinize reactor: file_accessible: coroutinize reactor: file_size: coroutinize reactor: file_stat: coroutinize > reactor: Mark some sched-stats getters const > Merge 'coroutine: allocate coroutine frame in a critical section' from Avi Kivity coroutine: allocate coroutine frame in a critical section memory: add C23 free_sized, free_aligned_sized > coroutines: simplify execute_involving_handle_destruction_in_await_suspend() > coroutine: introduce try_future Closes scylladb/scylladb#27369	2025-12-03 10:55:47 +03:00
Nadav Har'El	7dc04b033c	test/cluster: fix missing racks in xfailing Alternator test Since Alternator is now using tablets by default, it's no longer possible to create an Alternator table on a 3-node cluster with a single rack - you need to have 3 racks to support RF=3. Most of the multi-node Alternator tests in test/cluster/test_alternator.py were already fixed to use a 3-rack cluster, but one test was missed because it was marked "xfail" so its new failure to create the table was missed. This patch adds the missing 3-rack setup, so the xfailing test returns to failing on the real bug - not on the table creation. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27382	2025-12-03 10:54:11 +03:00
Piotr Dulikowski	654ac9099b	db/view/view_building_coordinator: skip work if no view is built Even though that `view_building_coordinator::work_on_view_building` has an `if` at the very beginning which checks whether the currently processed base table is set, it only prints a message and continues executing the rest of the function regardless of the result of the check. However, some of the logic in the function assumes that the currently processed base table field is set and tries to access the value of the field. This can lead to the view building coordinator accessing a disengaged optional, which is undefined behavior. Fix the function by adding the clearly missing `co_await` to the check. A regression test is added which checks that the view building state observer - a different fiber which used to print a weird message due to erroneus view building coordinator behavior - does not print a warning. Fixes: scylladb/scylladb#27363 Closes scylladb/scylladb#27373	2025-12-03 09:44:28 +02:00
Andrzej Jackowski	ff1b212319	Update tools/cqlsh submodule The motivation for the update is using newer version of scylla-driver that supports new event type CLIENT_ROUTES_CHANGE. * tools/cqlsh 22401228...6badc992 (2): > Update scylla-driver version to 3.29.6 > Revert "Migrate workflows to Blacksmith" Closes scylladb/scylladb#27359	2025-12-02 15:14:26 +02:00
Calle Wilund	4e7ec9333f	gcp::object_storage: Include auth in exponential back-off-retry Fixes #27268 Refs #27268 Includes the auth call in code covered by backoff-retry on server error, as well as moves the code to use the shared primitive for this and increase the resilience a bit (increase retry count). v2: * Don't do backoff if we need to refresh credentials. * Use abort source for backoff if avail v3: * Include other retryable conditions in auth check Closes scylladb/scylladb#27269	2025-12-02 15:08:49 +02:00
Botond Dénes	357f91de52	Revert "Merge 'db/config: enable `ms` sstable format by default' from Michał Chojnowski" This reverts commit `b0643f8959`, reversing changes made to `e8b0f8faa9`. The change forgot to update sstables_manager::get_highest_supported_format(), which results in /system/highest_supported_sstable_version still returning me, confusing and breaking tests. Fixes: scylladb/scylla-dtest#6435 Closes scylladb/scylladb#27379	2025-12-02 14:38:56 +02:00
Taras Veretilnyk	a191503ddf	sstables: Extract file writer closing logic into separate methods Refactor the consume_end_of_stream() method by extracting the inline file writer closing logic into dedicated methods: - close_index_writer() - close_partitions_writer() - close_rows_writer()	2025-12-02 13:07:41 +01:00
Taras Veretilnyk	619bf3ac4b	sstables: Add components_digests to scylla metadata components Add components_digests struct with optional digest fields for storing CRC32 digests of individual SSTable components in Scylla metadata. Those includes: - Data - Compression - Filter - Statistics - Summary - Index - TOC - Partitions - Rows	2025-12-02 12:36:34 +01:00
Pawel Pery	b5c85d08bb	unittest: fix vector_store_client_test_dns_refresh_aborted hangs The root cause for the hanging test is a concurrency deadlock. `vector_store_client` runs dns refresh time and it is waiting for the condition variable.After aborting dns request the test signals the condition variable. Stopping the vector_store_client takes time enough to trigger the next dns refresh - and this time the condition variable won't be signalled - so vector_store_client will wait forever for finish dns refresh fiber. The commit fixes the problem by waiting for the condition variable only once. Fixes: #27237 Fixes: VECTOR-370 Closes scylladb/scylladb#27239	2025-12-02 12:22:44 +01:00
Piotr Dulikowski	3aaab5d5a3	Merge 'vector_search: Fix high availability during timeouts' from Karol Nowacki This PR introduces two key improvements to the robustness and resource management of vector search: Proper Abort on CQL Timeout: Previously, when a CQL query involving a vector search timed out , the underlying ANN query to the vector store was not aborted and would continue to run. This has been fixed by ensuring the abort source is correctly signaled, terminating the ANN request when its parent CQL query expires and preventing unnecessary resource consumption. Faster Failure Detection: The connection and keep-alive timeouts for vector store nodes were excessively long (2 and 11 minutes, respectively), causing significant delays in detecting and recovering from unreachable nodes. These timeouts are now aligned with the request_timeout_in_ms setting, allowing for much faster failure detection and improving high availability by failing over from unresponsive nodes more quickly. Fixes: SCYLLADB-76 This issue affects the 2025.4 branch, where similar HA recovery delays have been observed. Closes scylladb/scylladb#27377 * github.com:scylladb/scylladb: vector_search: Fix ANN query abort on CQL timeout vector_search: Reduce connection and keep-alive timeouts	2025-12-02 11:14:48 +01:00
Ernest Zaslavsky	605f71d074	s3_client: handle additional transient network errors Add handling for a broader set of transient network-related `std::errc` values in `aws_error::from_system_error`. Treat these conditions as retryable when the client re-creates the socket for each request. Fixes: https://github.com/scylladb/scylladb/issues/27349 Closes scylladb/scylladb#27350	2025-12-02 11:44:40 +02:00
Karol Nowacki	086c6992f5	vector_search: Fix ANN query abort on CQL timeout When a CQL vector search request timed out, the underlying ANN query was not aborted and continued to run. This happened because the abort source was not being signaled upon request expiration. This commit ensures the ANN query is aborted when the CQL request times out preventing unnecessary resource consumption.	2025-12-02 01:17:01 +01:00
Karol Nowacki	b6afacfc1e	vector_search: Reduce connection and keep-alive timeouts The connection timeout was 2 minutes and the keep-alive timeout was 11 minutes. If a vector store node became unreachable, these long timeouts caused significant delays before the system could recover, negatively impacting high availability. This change aligns both timeouts with the `request_timeout` configuration, which defaults to 10 seconds. This allows for much faster failure detection and recovery, ensuring that unresponsive nodes are failed over from more quickly.	2025-12-02 01:17:01 +01:00
Łukasz Paszkowski	0ed3452721	service/storage_service: Mark nodes excluded on shard0 Excluding nodes is a group0 operation and as such it needs to be executed onyl on shard0. In case, the method `mark_excluded` is invoked on a different shard, redirect the request to shard0. Fixes https://github.com/scylladb/scylladb/issues/27129 Closes scylladb/scylladb#27167	2025-12-01 17:30:40 +01:00
Jenkins Promoter	c3c0991428	Update pgo profiles - aarch64	2025-12-01 13:47:56 +02:00
Jenkins Promoter	563e5ddd62	Update pgo profiles - x86_64	2025-12-01 04:24:36 +02:00
Artsiom Mishuta	796205678f	test.py: set worksteal distribution set worksteal disribution for xdist(new sheduler) Because now it shows better tests distribution that standart(load) in CI Closes scylladb/scylladb#27354	2025-11-30 18:13:03 +02:00
Emil Maskovsky	902d70d6b2	.github: add Copilot instructions for AI-generated code Add comprehensive coding guidelines for GitHub Copilot to improve quality and consistency of AI-generated code. Instructions cover C++ and Python development with language-specific best practices, build system usage, and testing workflows. Following GitHub Copilot's standard layout with general instructions in .github/copilot-instructions.md and language-specific files in .github/instructions/ directory using *.instructions.md naming. No backport: This change is only for developers in master, so it doesn't need to be backported. Closes scylladb/scylladb#25374	2025-11-30 13:30:05 +02:00
Avi Kivity	ce2a403f18	Merge 'alternator: implement gzip-compressed requests' from Nadav Har'El In this series we implement Alternator's support for gzip-compressed requests, i.e., requests with the "Content-Encoding: gzip" header, other uncompressed header, and a gzip-compressed body. The server needs to verify the signature of the compressed content, and then uncompress the body before running the request. We only support gzip compression because this is what DynamoDB supports. But in the future we can easily add support for other compression algorithms like lz4 or zstd. This series Refs #5041 but doesn't "Fixes" it because it only implements compressed requests (Content-Encoding), not compressed responses (Accept-Encoding). In addition to the code changes, the series also contains tests for this feature that make sure it behaves like DynamoDB. Note that while we will have now support in our server for compressed requests, just like DynamoDB does, the clients (AWS SDKs) will probably NOT make use of it because they do not enable request compression by default. For example, see the tests for some hoops one needs to jump through in boto3 (the Python SDK) to send compressed requests. However, we are hoping that in the future Alternator's modified clients will use compressed requests and enjoy this feature. Closes scylladb/scylladb#27080 * github.com:scylladb/scylladb: test/alternator: enable, and add, tests for gzip'ed requests alternator: implement gzip-compressed requests	2025-11-30 13:27:46 +02:00
Avi Kivity	d4be9a058c	Update seastar submodule seastar::compat::source_location (which should not have been used outside Seastar) is replaced with std::source_location to avoid deprecation warnings. The relevant header, which was removed, is no longer included. * seastar 8c3fba7a...b5c76d6b (3): > testing: There can be only one memory_data_sink > util: Use std::source_location directly > Merge 'net: support proxy protocol v2' from Avi Kivity apps: httpd: add --load-balancing-algorithm apps: httpd: add /shard endpoint test: socket_test: add proxy protocol v2 test suite test: socket_test: test load balancer with proxy protocol net: posix_connected_socket: specialize for proxied connections net: posix_server_socket_impl: implement proxy protocol in server sockets net: posix_server_socket_impl: adjust indentation net: posix_server_socket_impl: avoid immediately-invoked lambda net: conntrack: complete handle nested class special member functions net: posix_server_socket_impl: coroutinize accept() Closes scylladb/scylladb#27316	2025-11-30 12:38:47 +02:00
Piotr Dulikowski	44c605e59c	Merge 'Fix the types of change events in Alternator Streams' from Piotr Wieczorek This patch increases the compatibility with DynamoDB Streams by integrating the DynamoDB's event type rules (described in https://github.com/scylladb/scylladb/issues/6918) into Alternator. The main changes are: - introduce a new flag `alternator_streams_strict_compatibility`, meant as a guard of performance-intensive operations that increase the compatibility with DynamoDB Streams. If enabled, Alternator always performs a RBW before a data-modifying operation, and propagates its result to CDC. Then, the old item is compared to the new one, to determine the mutation type (INSERT vs MODIFY). This option is a no-op for tables with disabled Alternator Streams, - reduce splitting of simple Alternator mutations, - correctly distinguish event types described in #6918, except for item deletes. Deleting a missing item with DeleteItem, BatchWriteItem, or a missing field with UpdateItem still emit REMOVEs. To summarize, the emitted events of the data manipulation operations should be as follows: - DeleteItem/BatchWriteItem.DeleteItem of existing item: REMOVE (OK) - DeleteItem of nonexistent item: nothing (OK) - BatchWriteItem.DeleteItem of nonexistent item: nothing (OK) - PutItem/UpdateItem/BatchWriteItem.PutItem of existing and not equal item: MODIFY (OK) - PutItem/UpdateItem/BatchWriteItem.PutItem of existing and equal item: nothing (OK) - PutItem/UpdateItem/BatchWriteItem.PutItem of nonexistent item: INSERT (OK) No backport is necessary. Refs https://github.com/scylladb/scylladb/pull/26149 Refs https://github.com/scylladb/scylladb/pull/26396 Refs https://github.com/scylladb/scylladb/issues/26382 Fixes https://github.com/scylladb/scylladb/issues/6918 Closes scylladb/scylladb#26121 * github.com:scylladb/scylladb: test/alternator: Enable the tests failing because of #6918 alternator, cdc: Don't emit events for no-op removes alternator, cdc: Don't emit an event for equal items alternator/streams, cdc: Differentiate item replace and item update in CDC alternator: Change the return type of rmw_operation_return config: Add alternator_streams_strict_compatibility flag cdc: Don't split a row marker away from row cells	2025-11-30 07:20:22 +01:00
Asias He	da5cc13e97	repair: Fix deadlock when topology coordinator steps down in the middle Consider this: 1) n1 is the topology coordinator 2) n1 schedules and executes a tablet repair with session id s1 for a tablet on n3 an n4. 3) n3 and n4 take and store the in _rs._repair_compaction_locks[s1] 4) n1 steps down before it executes locator::tablet_transition_stage::end_repair 5) n2 becomes the new topology coordinator 6) n2 runs locator::tablet_transition_stage::repair again 7) n3 and n4 try to take the lock again and hangs since the lock is already taken. To avoid the deadlock, we can throw in step 7 so that n2 will proceed to end_repair stage and release the lock. After that, the scheduler could schedule the tablet repair request again. Fixes #26346 Closes scylladb/scylladb#27163	2025-11-28 15:14:39 +01:00
Radosław Cybulski	b54a9f4613	Fix use-after-free in encode_paging_state in Alternator Fix unlikely use-after-free in `encode_paging_state`. The function incorrectly assumes that current position to encode will always have data for all clustering columns the schema defines. It's possible to encounter current position having less than all columns specified, for eample in case of range tombstone. Those don't happen in Alternator tables as DynamoDB doesn't allow range deletions and clustering key might be of size at most 1. Alternator api can be used to read scylla system tables and those do have range tombstones with more than single clustering column. The fix is to stop trying to encode columns, that don't have the value - they are not needed anyway, as there's no possible position with those values (range tombstone made sure of that). Fixes #27001 Fixes #27125 Closes scylladb/scylladb#26960	2025-11-28 16:51:15 +03:00
Pavel Emelyanov	d35ce81ff1	Merge 'test: wait for read_barrier in wait_until_driver_service_level_created' from Andrzej Jackowski Previously, `wait_until_driver_service_level_created` only waited for the `driver` service level to appear in the output of `LIST ALL SERVICE_LEVELS`. However, the fact that one node lists `sl:driver` does not necessarily mean that all other nodes can see it yet. This caused sporadic test failures, especially in DEBUG builds. To prevent these failures, this change adds an extra wait for a `raft/read_barrier` after the `driver` service level first appears. This ensures the service level is globally visible across the cluster. Fixes: https://github.com/scylladb/scylladb/issues/27019 Na backport - test fix for `sl:driver` tests, and this that is only available on `master` Closes scylladb/scylladb#27076 * github.com:scylladb/scylladb: test: wait for read_barrier in wait_until_driver_service_level_created test: use ManagerClient in wait_until_driver_service_level_created	2025-11-28 16:47:29 +03:00
Dawid Mędrek	b76af2d07f	cql3: Improve errors when manipulating default service level Before this commit, any attempt to create, alter, attach, or drop the default service level would result in a syntax error whose error message was unclear: ``` cqlsh> attach service level default to cassandra; SyntaxException: line 1:21 no viable alternative at input 'default' ``` The error stems from the grammar not being able to parse `default` as a correct service level name. To fix that, we cover it manually. This way, the grammar accepts it and we can process it in Scylla. The reason why we'd like to cover the default service level is that it's an actual service level that the user should reference. Getting a syntax error is not what should happen. Hence this fix. We validate the input and if the given role is really the default service level, we reject the query and provide an informative error message. Two validation tests are provided. Fixes scylladb/scylladb#26699 Closes scylladb/scylladb#27162	2025-11-28 15:32:37 +03:00
Dawid Mędrek	48a28c24c5	db/commitlog: Include position and alignment information in errors When we come across a segment truncation, this information may be helpful to determine when the error occurred exactly and hint at what code path might've led to it. Closes scylladb/scylladb#27207	2025-11-28 15:28:08 +03:00
Calle Wilund	59c87025d1	commitlog::read_log_file: Check for eof position on all data reads Fixes #24346 When reading, we check for each entry and each chunk, if advancing there will hit EOF of the segment. However, IFF the last chunk being read has the last entry _exactly_ matching the chunk size, and the chunk ending at _exactly_ segment size (preset size, typically 32Mb), we did not check the position, and instead complained about not being able to read. This has literally _never_ happened in actual commitlog (that was replayed at least), but has apparently happened more and more in hints replay. Fix is simple, just check the file position against size when advancing said position, i.e. when reading (skipping already does). v2: * Added unit test Closes scylladb/scylladb#27236	2025-11-28 15:26:46 +03:00
Ernest Zaslavsky	1d5f60baac	streaming:: add more logging Start logging all missed streaming options like `scope`, `primary_replica` and `skip_reshape` flags Fixes: https://github.com/scylladb/scylladb/issues/27299 Closes scylladb/scylladb#27311	2025-11-28 12:50:33 +01:00
Emil Maskovsky	37e3dacf33	topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted In several exception handlers, only raft::request_aborted was being caught and rethrown, while seastar::abort_requested_exception was falling through to the generic catch(...) block. This caused the exception to be incorrectly treated as a failure that triggers rollback, instead of being recognized as an abort signal. For example, during tablet draining, the error log showed: "tablets draining failed with seastar::abort_requested_exception (abort requested). Aborting the topology operation" This change adds seastar::abort_requested_exception handling alongside raft::request_aborted in all places where it was missing. When rethrown, these exceptions propagate up to the main run() loop where handle_topology_coordinator_error() recognizes them as normal abort signals and allows the coordinator to exit gracefully without triggering unnecessary rollback operations. Fixes: scylladb/scylladb#27255 No backport: The problem was only seen in tests and not reported in customer tickets, so it's enough to fix it in the main branch. Closes scylladb/scylladb#27314	2025-11-28 12:19:21 +01:00
Michael Litvak	97b7c03709	tablet: scheduler: Do not emit conflicting migration in merge colocation The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well. Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates. This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations. Fixes scylladb/scylladb#27304 Closes scylladb/scylladb#27312	2025-11-28 11:17:12 +01:00
Taras Veretilnyk	62802b119b	sstables: Implement CRC32 digest-only writer Introduce template parameter to checksummed file writer to support digest-only calculation without storing chunk checksums. This will be needed for future to calculate digest of other components.	2025-11-27 22:40:07 +01:00
Pavel Emelyanov	54edb44b20	code: Stop using seastar::compat::source_location And switch to std::source_location. Upcoming seastar update will deprecate its compatibility layer. The patch is for f in $(git grep -l 'seastar::compat::source_location'); do sed -e 's/seastar::compat::source_location/std::source_location/g' -i $f; done and removal of few header includes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27309	2025-11-27 19:10:11 +02:00
Avi Kivity	c85671ce51	scripts: refresh-submodules: don't omit last (first) commit `git log --format` doesn't add a newline after the last line. This causes `read` to ignore that line, losing the last line (corresponding to the first commit). Use `git log --tformat` instead, which terminates the last line. Closes scylladb/scylladb#27317	2025-11-27 18:46:27 +02:00
Botond Dénes	9b968dc72c	docs: update dependencies Via make update. Fixes: scylladb/scylladb#27231 Closes scylladb/scylladb#27263	2025-11-27 15:56:34 +03:00
Andrzej Jackowski	e366030a92	treewide: seastar module update The reason for this seastar update is to have the fixed handling of the `integer` type in `seastar-json2code` because it's needed for further development of ScyllaDB REST API. The following changes were introduced to ScyllaDB code to ensure it compiles with the updated seastar: - Remove `seastar/util/modules.hh` includes as the file was removed from seastar - Modified `metrics::impl::labels_type` construction in `test/boost/group0_test.cc` because now it requires `escaped_string` * seastar 340e14a7...8c3fba7a (32): > Merge 'Remove net::packet usage from dns.cc' from Pavel Emelyanov dns: Optimize packet sending for newer c-ares versions dns: Replace net::packet with vector<temporary_buffer> dns: Remove unused local variable dns: Remove pointless for () loop wrapping dns: Introduce do_sendv_tcp() method dns: Introduce do_send_udp() method > test: Add http rules test of matching order > Merge 'Generalize packet_data_source into memory_data_source' from Pavel Emelyanov memcached: Patch test to use memory_data_source memcached: Use memory_data_source in server rpc: Use memory_data_sink without constructing net::packet util: Generalize packet_data_source into memory_data_source > tests: coroutines: restore "explicit this" tests > reactor: remove blocking of SIGILL > Merge 'Update compilers in GH actions scripts' from Pavel Emelyanov github: Use gcc-14 github: Use clang-20 > Merge 'Reinforce DNS reverse resolution test ' from Pavel Emelyanov test: Make test_resolve() try several addresses test: Coroutinize test_resolve() helper > modules: make module support standards-compliant > Merge 'Fix incorrect union access in dns resolver' from Pavel Emelyanov dns: Squash two if blocks together dns: Do not check tcp entry for udp type > coroutine: Fix compilation of execute_involving_handle_destruction_in_await_suspend > promise: Document that promise is resolved at most once > coroutine: exception: workaround broken destroy coroutine handle in await_suspend > socket: Return unspecified socket_address for unconnected socket > smp: Fix exception safety of invoke_on_... internal copying > Merge 'Improve loads evaluation by reactor' from Pavel Emelyanov reactor: Keep loads timer on reactor reactor: Update loads evaluation loop > Merge 'scripts: add 'integer' type to seastar-json2code' from Andrzej Jackowski test: extend tests/unit/api.json to use 'integer' type scripts: add 'integer' type to seastar-json2code > Merge 'Sanitize tls::session::do_put(_one)? overloads' from Pavel Emelyanov tls: Rename do_put_one(temporary_buffer) into do_put() tls: Fix indentation after previous patch tls: Move semaphore grab into iterating do_put() > net: tcp: change unsent queue from packets to temporary_buffer:s > timer: Enable highres timer based on next timeout value > rpc: Add a new constructor in closed_error to accept string argument > memcache: Implement own data sink for responses > Merge 'file: recursive_remove_directory: general cleanup' from Avi Kivity file: do_recursive_remove_directory(): move object when popping from queue file: do_recursive_remove_directory(): adjust indentation file: do_recursive_remove_directory(): coroutinize file: do_recursive_remove_directory(): simplify conditional file: do_recursive_remove_directory(): remove wrong const file: do_recursive_remove_directory(): clean up work_entry > tests: Move thread_context_switch_test into perf/ > test: Add unit test for append_challenged_posix_file > Merge 'Prometheus metrics handler optimization' from Travis Downs prometheus: optimize metrics aggregation prometheus: move and test aggregate_by helper prometheus: various optimizations metrics: introduce escaped_string for label values metric:value: implement + in terms of += tests: add prometheus text format acceptance tests extract memory_data_sink.hh metrics_perf: enhance metrics bench > demos: Simplify udp_zero_copy_demo's way of preparing the packet > metrics: Remove deprecated make_...-ers > Merge 'Make slab_test be BOOST kind' from Pavel Emelyanov test: Use BOOST_REQUIRE checkers test: Replace some SEASTAR_ASSERT-s with static_assert-s test: Convert slab test into boost kind > Merge 'Coroutinize lister_test' from Pavel Emelyanov test: Fix indentation after previuous patch test: Coroutinize lister_test lister::report() method test: Coroutinize lister_test main code > file: recursive_remove_directory(): use a list instead of a deque > Merge 'Stop using packets in tls data_sink and session' from Pavel Emelyanov tls: Stop using net::packet in session::put() tls: Fix indentation after previous patch tls: Split session::do_put() tls: Mark some session methods private Closes scylladb/scylladb#27240	2025-11-27 12:34:22 +02:00
Nadav Har'El	32afcdbaf0	test/alternator: enable, and add, tests for gzip'ed requests After in the previous patch we implemented support in Alternator for gzip-compressed requests ("Content-Encoding: gzip"), here we enable an existing xfail-ing test for this feature, and also add more tests for more cases: * A test for longer compressed requests, or a short compressed request which expands to a longer request. Since the decompression uses small buffers, this test reaches additional code paths. * Check for various cases of a malformed gzip'ed request, and also an attempt to use an unsupported Content-Encoding. DynamoDB returns error 500 for both cases, so we want to test that we do to - and not silently ignore such errors. * Check that two concatenated gzip'ed streams is a valid request, and check that garbage at the end of the gzip - or a missing character at the end of the gzip - is recognized as an error. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-27 09:42:47 +02:00
Wojciech Mitros	323e5cd171	mv: allow setting concurrency in PRUNE MATERIALIZED VIEW The PRUNE MATERALIZED VIEW statement is performed as follows: 1. Perform a range scan of the view table from the view replicas based on the ranges specified in the statement. 2. While reading the paged scan above, for each view row perform a read from all base replicas at the corresponding primary key. If a discrepancy is detected, delete the row in the view table. When reading multiple rows, this is very slow because for each view row we need to performe a single row query on multiple replicas. In this patch we add an option to speed this up by performing many of the single base row reads concurrently, at the concurrency specified in the USING CONCURRENCY clause. Fixes https://github.com/scylladb/scylladb/issues/27070	2025-11-27 00:02:28 +01:00
Tomasz Grabiec	d6c14de380	Merge 'locator/node: include _excluded in missing places' from Patryk Jędrzejczak We currently ignore the `_excluded` field in `node::clone()` and the verbose formatter of `locator::node`. The first one is a bug that can have unpredictable consequences on the system. The second one can be a minor inconvenience during debugging. We fix both places in this PR. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72 This PR is a bugfix that should be backported to all supported branches. Closes scylladb/scylladb#27265 * github.com:scylladb/scylladb: locator/node: include _excluded in verbose formatter locator/node: preserve _excluded in clone()	2025-11-26 18:29:59 +01:00
Asias He	ab4896dc70	topology_coordinator: Send incremental repair rpc only when the feature is enabled Otherwise, in a mixed cluster, the handle_tablet_resize_finalization would fail because of the unknown rpc verb. Fixes #26309 Closes scylladb/scylladb#27218	2025-11-26 15:25:36 +01:00
Patryk Jędrzejczak	287c9eea65	locator/node: include _excluded in verbose formatter It can be helpful during debugging.	2025-11-26 13:26:17 +01:00
Patryk Jędrzejczak	4160ae94c1	locator/node: preserve _excluded in clone() We currently ignore the `_excluded` field in `clone()`. Losing information about exclusion can have unpredictable consequences. One observed effect (that led to finding this issue) is that the `/storage_service/nodes/excluded` API endpoint sometimes misses excluded nodes.	2025-11-26 13:26:11 +01:00
Patryk Jędrzejczak	cc273e867d	Merge 'fix notification about expiring erm held for to long' from Gleb Natapov Commit `6e4803a750` broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should. Fixes https://github.com/scylladb/scylladb/issues/27141 Closes scylladb/scylladb#27140 * https://github.com/scylladb/scylladb: test: test that expired erm that held for too long triggers notification token_metadata: fix notification about expiring erm held for to long	2025-11-26 12:59:00 +01:00
Amnon Heiman	68c7236acb	vector_index: require tablets for vector indexes This patch enforces that vector indexes can only be created on keyspaces that use tablets. During index validation, `check_uses_tablets()` verifies the base keyspace configuration and rejects creation otherwise. To support this, the `custom_index::validate()` API now receives a `const data_dictionary::database&` parameter, allowing index implementations to access keyspace-level settings during DDL validation. Fixes https://scylladb.atlassian.net/browse/VECTOR-322 Closes scylladb/scylladb#26786	2025-11-26 13:30:43 +02:00
Marcin Maliszkiewicz	dd461e0472	auth: use auth cache on login path This path may become hot during connection storms that's why we want it to stress the node as little as possible.	2025-11-26 12:01:33 +01:00
Marcin Maliszkiewicz	0c9b2e5332	auth: corutinize standard_role_manager::can_login Corutinize so that it's easier to add new logic in following commit.	2025-11-26 12:01:32 +01:00
Marcin Maliszkiewicz	b29c42adce	main: auth: add auth cache dependency to auth service In the following commit we'll switch some authorizer and role manager code to use the cache so we're preparing the dependency.	2025-11-26 12:01:31 +01:00
Marcin Maliszkiewicz	ea3dc0b0de	raft: update auth cache when data changes When applying group0_command we now inspect whether any auth internal tables were modified, and reload affected role entries in the cache. Since one auth DML may change multiple tables, when iterating over mutations we deduplicate affected roles across those tables.	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	2a6bef96d6	auth: storage_service: reload auth cache on v1 to v2 auth migration	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	19da1cb656	raft: reload auth cache on snapshot application Receiving snaphot is a rare event so as a simplification we'll be reloading the whole cache instead of trying to merge states, especially that expected size is small, below 100 records. Reloading is non-disruptive operation, old entries are removed only after all entries are loaded. If entry is updated, shared pointer will be atomically replaced in a cache map.	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	2cf1ca43b5	service: add auth cache getter to storage service Prepare for use in a subsequent commit in group0_state_machine, where the auth cache will be integrated. This follows the same pattern as updates to the service-level cache, view-building state, and CDC streams.	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	642f468c59	main: start auth cache service The service is not yet used anywhere, we first build scaffolding.	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	bd7c87731b	auth: add unified cache implementation It combines data from all underlying auth tables. Supports gentle full load and per role reloads. Loading is done on shard 0 and then deep copies data to all shards.	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	4c667e87ec	auth: move table names to common.hh They will be used additionally in cache code, added in following commits.	2025-11-26 12:00:50 +01:00
Nadav Har'El	f4555be8a5	docs/alternator: list another unimplemented Alternator feature A new feature was announced this week for Amazon DynamoDB, "multi- attribute composite keys in global secondary indexes", which allows to create GSIs with composite keys (multiple columns). This feature already existed in CQL's materialized views, but didn't exist in DynamoDB until now. So this patch adds a paragraph to our docs/alternator/compatibility.md mentioning that we don't support this DynamoDB feature yet. See also issue #27182 which we opened to track this unimplemented feature. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27183	2025-11-26 12:10:37 +02:00
Pavel Emelyanov	943350fd35	scripts: Add target branch checking in PR merging script Sometimes (though rarely) I call this script on mis-matching PR and current branch. E.g. trying to merge master PR into stable next, or 2025.X PR into next-2025.Y (X != Y). Typically merge fails, but it's good to catch it early. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27249	2025-11-26 12:10:16 +02:00
Nadav Har'El	9cde93e3da	Merge 'db/view/view_building_coordinator: get rid of task's state in group0' from Michał Jadwiszczak Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability. With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead. In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments. Fixes https://github.com/scylladb/scylladb/issues/26311 This patch needs to be backported to 2025.4. Closes scylladb/scylladb#26897 * github.com:scylladb/scylladb: docs/dev/view-building-coordinator: update the docs after recent changes db/view/view_building: send coordinator's term in the RPC db/view/view_building_state: replace task's state with `aborted` flag db/view/view_building_coordinator: batch finished tasks reporting db/view/view_building_worker: change internal implementation db/view/view_building_coordinator: change `work_on_tasks` RPC return type	2025-11-26 11:35:44 +02:00
dependabot[bot]	86cd0a4dce	build(deps): bump sphinx-multiversion-scylla in /docs Bumps [sphinx-multiversion-scylla](https://holzhaus.github.io/sphinx-multiversion/) from 0.3.3 to 0.3.4. --- updated-dependencies: - dependency-name: sphinx-multiversion-scylla dependency-version: 0.3.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#27214	2025-11-26 06:57:02 +02:00
tomek7667	9bbdd487b4	docs: insert.rst: Update insert example by removing 'year' column Closes scylladb/scylladb#26862	2025-11-26 06:55:28 +02:00
tomek7667	2138ab6b0e	docs: insert.rst: fix INSERT statement for NerdMovies example Closes scylladb/scylladb#26863	2025-11-26 06:53:45 +02:00
tomek7667	90a6aa8057	docs: ddl.rst: Fix formatting of null value note Closes scylladb/scylladb#26853	2025-11-26 06:52:18 +02:00
Botond Dénes	384bffb8da	Merge 'compaction: limit the maximum shares allocated to a compaction scheduling class' from Raphael Raph Carvalho This PR adds support for limiting the maximum shares allocated to a compaction scheduling class by the compaction controller. It introduces a new configuration parameter, compaction_max_shares, which, when set to a non zero value, will cap the shares allocated to compaction jobs. This PR also exposes the shares computed by the compaction controller via metrics, for observability purposes. Fixes https://github.com/scylladb/scylladb/issues/9431 Enhancement. No need to backport. NOTE: Replaces PR https://github.com/scylladb/scylladb/pull/26696 Ran a test in which the backlog raised the need for max shares (normalized backlog above normalization_factor), and played with different values for new option compaction_max_shares to see it works (500, 1000, 2000, 250, 50) Closes scylladb/scylladb#27024 * github.com:scylladb/scylladb: db/config: introduce new config parameter `compaction_max_shares` compaction_manager:config: introduce max_shares compaction_controller: add configurable maximum shares compaction_controller: introduce `set_max_shares()`	2025-11-26 06:51:30 +02:00
Avi Kivity	1f6e3301e7	dist: systemd: drop deprecated CPU and I/O shares/weight from scylla-server.slice The BlockIOWeight and CPUShares are deprecated. They are only used on RHEL 7, which has reached end-of-life. Their replacements, IOWeight and CPUWeight, are already set in the file. Remove the deprecated settings to reduce noise in the logs. Closes scylladb/scylladb#27222	2025-11-26 06:42:11 +02:00
Yaniv Michael Kaul	765a7e9868	gms/gossiper.cc: fix gossip log to show host-id/ip instead of host-id/host-id Probably a copy-paste error, fixes the log to print host-id/ip. Backport: no need, benign log issue. Fixes: https://github.com/scylladb/scylladb/issues/27113 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#27225	2025-11-25 20:56:20 +01:00
Wojciech Mitros	3c376d1b64	alternator: use storage_proxy from the correct shard in executor::delete_table When we delete a table in alternator, the schema change is performed on shard 0. However, we actually use the storage_proxy from the shard that is handling the delete_table command. This can lead to problems because some information is stored only on shard 0 and using storage_proxy from another shard may make us miss it. In this patch we fix this by using the storage_proxy from shard 0 instead. Fixes https://github.com/scylladb/scylladb/issues/27223 Closes scylladb/scylladb#27224	2025-11-25 18:56:31 +01:00
Botond Dénes	584f4e467e	tools/scylla-sstable: introduce the dump-schema command There is a limited number of ways to obtain the schema of a table: 1) Use DESCRIBE TABLE in cqlsh 2) Find the schema definition in the code (for system tables) 3) Ask support/user to provide schema 4) Piece together the schema definition from the system tables Option (1) is the most convenient but requires access to live cluster. (2) is limited to system tables only. When investigating issues for customers, we have to rely on (3) and this often adds communication round-trips and delays. (4) requires knowledge of ScyllaDB internals and access to system tables. The new dump-schema commands provides a convenient way to obtain the schema of tables, given that there is access to either an sstable or the system tables. It can dump the schema of system tables without either. Closes scylladb/scylladb#26433	2025-11-25 20:32:36 +03:00
Nadav Har'El	4c7c5f4af7	alternator: implement gzip-compressed requests In this patch we implement Alternator's support for gzip-compressed requests, i.e., requests with the "Content-Encoding: gzip" header, other uncompressed headers, and a gzip-compressed body. The server needs to verify the signature of the compressed content, and then uncompress the body before running the request. We only support gzip compression because this is what DynamoDB supports. But in the future we can easily add support for other compression algorithms like lz4 or zstd. This patch Refs #5041 but doesn't "Fixes" it because it only implements compressed requests (Content-Encoding), not compressed responses (Accept-Encoding). The next patch will enable several tests for this feature and make sure it behaves like DynamoDB. Note that while we will have now support in our server for compressed requests, just like DynamoDB does, the clients (AWS SDKs) will probably NOT make use of it because they do not enable request compression by default. For example, see the tests for some hoops one needs to jump through in boto3 (the Python SDK) to send compressed requests. However, we are hoping that in the future Alternator's modified clients will use compressed requests and enjoy this feature. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-25 17:46:44 +02:00
Gleb Natapov	5dcdaa6f66	test: test that expired erm that held for too long triggers notification	2025-11-25 17:33:54 +02:00
Piotr Dulikowski	ff5c7bd960	Merge 'topology_coordinator: don't repair colocated tablets' from Michael Litvak With the introduction of colocated tables, all the tablet transitions now operate on groups of colocated tablets instead of individual tablets. such is tablet migration, and also tablet repair. The tablet repair currently doesn't work on individual tablets due to the limitations in the tablet map being shared. The way it was implemented to work on a group of colocated tablets is by repairing all the colocated tablets together, using a dedicated rpc, and setting a shared repair_time in the shared tablet map. It was implemented this way because we wanted to have some way to repair the tablets of a colocated table. However, we want to change this in the next release so that it will be possible to repair the tablets of a colocated table individually. In order to simplify and prepare for the future change, we prefer until then to not repair colocated tables at all. otherwise, we will need to support both the shared repair and individual repair together for a long time, and the upgrade will be more complicated. We change the handling of the tablet 'repair' transition to repair only the base table's tablets. It means it will not be possible to request tablet repair for a non-base colocated table such as local MV, CDC and paxos table. This restriction will be temporary until a later release where we will suuport repairing colocated tablets. This is a reasonable restriction because repair for these kind of tables is not required or as important as for normal tables. Fixes https://github.com/scylladb/scylladb/issues/27119 backport to 2025.4 since we must change it in the same version it's introduced before it's released Closes scylladb/scylladb#27120 * github.com:scylladb/scylladb: tombstone_gc: don't use 'repair' mode for colocated tables Revert "storage service: add repair colocated tablets rpc" topology_coordinator: don't repair colocated tablets	2025-11-25 14:58:06 +01:00
David Garcia	64a65cac55	docs: add metrics generation validation fix: windows gitbash support fix: new name found with no group vector_search/vector_store_client.cc 343 fix: rm allowmismatch fix: git bash (windows) compatibility fix: git bash (windows) compatibility Closes scylladb/scylladb#26173	2025-11-25 15:39:52 +03:00
Gleb Natapov	9f97c376f1	token_metadata: fix notification about expiring erm held for to long Commit `6e4803a750` broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix assign operator to call destructor.	2025-11-25 13:35:24 +02:00
Michał Jadwiszczak	fe9581f54c	docs/dev/view-building-coordinator: update the docs after recent changes Remove information about view building task state and explain how current lifetime of the task.	2025-11-25 12:14:05 +01:00
Michał Jadwiszczak	fb8cbf1615	db/view/view_building: send coordinator's term in the RPC To avoid case when an old coordinator (which hasn't been stopped yet) dictates what should be done, add raft term to the `work_on_view_building_tasks` RPC. The worker needs to check if the term matches the current term from raft server, and deny the request when the term is bad.	2025-11-25 12:14:05 +01:00
Michał Jadwiszczak	24d69b4005	db/view/view_building_state: replace task's state with `aborted` flag After previous commits, we can drop entire task's state and replace it with single boolean flag, which determines if a task was aborted. Once a task was aborted, it cannot get resurrected to a normal state.	2025-11-25 12:14:04 +01:00
Michał Jadwiszczak	eb04af5020	db/view/view_building_coordinator: batch finished tasks reporting In previous implementation to execute view building tasks, the coordinator needed to firstly set their states to `STARTED` and then it needed to remove them before it could start the next ones. This logic required a lot of group0 commits, especially in large clusters with higher number of nodes and big tablet count. After previous commit to the view building worker, the coordinator can start view building tasks without setting the `STARTED` state and deleting finished tasks. This patch adjusts the coordinator to save finished tasks locally, so it can continue to execute next ones and the finished tasks are periodically removed from the group0 by `finished_task_gc_fiber()`.	2025-11-25 12:14:04 +01:00
dependabot[bot]	b911a643fd	build(deps): bump sphinx-scylladb-theme from 1.8.8 to 1.8.9 in /docs Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.8 to 1.8.9. - [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases) - [Commits](https://github.com/scylladb/sphinx-scylladb-theme/commits) --- updated-dependencies: - dependency-name: sphinx-scylladb-theme dependency-version: 1.8.9 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#27169	2025-11-25 11:01:37 +02:00
Botond Dénes	1263e1de54	Merge 'docs: modify debian/ubutnu installation instructions' from Yaron Kaikov To support debian13, we need to modify the installation instructions since `apt-key` command is no longer available Also updated installation instruction to match the latest release Fixes: https://github.com/scylladb/scylladb/issues/26673 No need for backport since we added debian13 only in master for now Closes scylladb/scylladb#27205 * github.com:scylladb/scylladb: install-on-linux.rst: update installation example to supported release docs: modify debian/ubutnu installation instructions	2025-11-25 10:53:11 +02:00
Nadav Har'El	bcd1758911	Merge 'vector_search: add validator tests' from Pawel Pery The vector-search-validator is a binary tool which do functional and integration tests between scylla and vector-store. It is build in Rust mainly in vector-store repository. This patch adds possibility to write tests on scylladb repository side, compile them together with vector-store tests and run them in `test.py` environment. There are three parts of the change: - add sources of validator to the `test/vector_search_validator` directory - add support for building validator and vector-store in `build/vector-search-validator/bin` directory with or without cmake - add support for `pytest` and `test.py` to run validator test locally and in the CI environment; this part adds also README to the `test/vector_search_validator` directory Design for validator integration tests: https://scylladb.atlassian.net/wiki/spaces/RND/pages/39518215/Vector+Search+Core+Test+Plan+Document References: VECTOR-50 No backport needed as this is a new functionality. Closes scylladb/scylladb#26653 * github.com:scylladb/scylladb: vector_search: add vector-search-validator tests vector_search: implement building vector-search-validator vector_search: add vector-search-validator sources	2025-11-25 10:34:33 +02:00
Michael Litvak	868ac42a8b	tombstone_gc: don't use 'repair' mode for colocated tables For tables of special types that can be located: MV, CDC, and paxos table, we should not use tombstone_gc=repair mode because colocated tablets are never repaired, hence they will not have repair_time set and will never be GC'd using 'repair' mode.	2025-11-25 09:15:46 +01:00
Michael Litvak	005807ebb8	Revert "storage service: add repair colocated tablets rpc" This reverts commit `11f045bb7c`. The rpc was added together with colocated tablets in 2025.4 to support a "shared repair" operation of a group of colocated tablets that repairs all of them and allows also for special behavior as opposed to repairing a single specific tablet. It is not used anymore because we decided to not repair all colocated tablets in a single shared operation, but to repair only the base table, and in a later release support repairing colocated tables individually. We can remove the rpc in 2025.4 because it is introduced in the same version.	2025-11-25 09:06:48 +01:00
Michael Litvak	273f664496	topology_coordinator: don't repair colocated tablets With the introduction of colocated tables, all the tablet transitions now operate on groups of colocated tablets instead of individual tablets. such is tablet migration, and also tablet repair. The tablet repair currently doesn't work on individual tablets due to the limitations in the tablet map being shared. The way it was implemented to work on a group of colocated tablets is by repairing all the colocated tablets together, using a dedicated rpc, and setting a shared repair_time in the shared tablet map. It was implemented this way because we wanted to have some way to repair the tablets of a colocated table. However, we want to change this in the next release so that it will be possible to repair the tablets of a colocated table individually. In order to simplify and prepare for the future change, we prefer until then to not repair colocated tables at all. otherwise, we will need to support both the shared repair and individual repair together for a long time, and the upgrade will be more complicated. We change the handling of the tablet 'repair' transition to repair only the base table's tablets. It means it will not be possible to request tablet repair for a non-base colocated table such as local MV, CDC and paxos table. This restriction will be temporary until a later release where we will suuport repairing colocated tablets. This is a reasonable restriction because repair for these kind of tables is not required or as important as for normal tables. Fixes scylladb/scylladb#27119	2025-11-25 09:05:59 +01:00
Amnon Heiman	b2c2a99741	index/vector_index.cc: Don't allow zero as an index option This patch forces vector_index option value to be real-positive numbers as zero would make no senese. Fixes https://scylladb.atlassian.net/browse/VECTOR-249 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes scylladb/scylladb#27191	2025-11-25 10:05:44 +02:00
Karol Nowacki	ca62effdd2	vector_search: Restrict vector index tests to tablets only Vector indexes are going to be supported only for tablets (see VECTOR-322). As a result, tests using vector indexes will be failing when run with vnodes. This change ensures tests using vector indexes run exclusively with tablets. Fixes: VECTOR-49 Closes scylladb/scylladb#26843	2025-11-25 09:26:16 +02:00
Pawel Pery	9f10aebc66	vector_search: add vector-search-validator tests The commit adds a functionality for `pytest` and `test.py` to run `vector-search-validator` in `sudo unshare` environment. There are already two tests - first parametrized `test_validator.py::test_validator[test-case-name]` (run validator) and second `test_cargo_toml.py::test_cargo_toml` (check if the current `Cargo.toml` for validator is correct). Documentation for these tests are provided in `README.md`.	2025-11-24 17:26:04 +01:00
Pawel Pery	3702e982b9	vector_search: implement building vector-search-validator The commit adds targets building `build/vector-search-validator/bin/{vector-store,vector-search-validator}. The targets must be build for tests. They don't depend on build mode. The commit adds target in `configure.py` and also in `cmake`.	2025-11-24 17:26:04 +01:00
Pawel Pery	e569a04785	vector_search: add vector-search-validator sources The commit adds validator sources uses combination of local files and vector-store's files. In `build-env` there are definition of vector-store git repository and revision on which validator will be built. `cargo-toml-template` is script for printing current `Cargo.toml` to the stdout. After updating `build-env` developer needs to update new configuration with `./cargo-toml-template > Cargo.toml`. Git revision is used in several places in `Cargo.toml` and will be used for building `vector-store`, so for better handling git revision it should be setup only in one place. The validator is divided into several crates to be able to built it within scylladb and vector-store repositories. Here we need to create a new validator crate with simple `main` function and call `validator_engine::main` there. We provide tests written in scylladb repo in `validator-scylla` crate. The commit provides empty `cql` test case, which should be filled in the future.	2025-11-24 17:26:04 +01:00
Gleb Natapov	39cec4ae45	topology: let banned node know that it is banned Currently if a banned node tries to connect to a cluster it fails to create connections, but has no idea why, so from inside the node it looks like it has communication problems. This patch adds new rpc NOTIFY_BANNED which is sent back to the node when its connection is dropped. On receiving the rpc the node isolates itself and print an informative message about why it did so. Closes scylladb/scylladb#26943	2025-11-24 17:12:13 +01:00
Lakshmi Narayanan Sreethar	9cb766f929	db/config: introduce new config parameter `compaction_max_shares` Add support for the new configuration parameter `compaction_max_shares`, and update the compaction manager to pass it down to the compaction controller when it changes. The shares allocated to compaction jobs will be limited by this new parameter. Fixes #9431 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-11-24 12:52:29 -03:00
Lakshmi Narayanan Sreethar	468b800e89	compaction_manager:config: introduce max_shares Introduce an updateable value `max_shares` to compaction manager's config. Also add a method `update_max_shares()` that applies the latest `max_shares` value to the compaction controller’s `max_shares`. This new variable will be connected to a config parameter in the next patch. Refs #9431 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-11-24 11:43:38 -03:00
Lakshmi Narayanan Sreethar	f2b0489d8c	compaction_controller: add configurable maximum shares Add a `max_shares` constructor parameter to compaction_controller to allow configuring the maximum output of the control points at construction time. The constructor now calls `set_max_shares()` with the provided max_shares value. The subsequent commits will wire this value to a new configuration option. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-11-24 11:43:24 -03:00
Lakshmi Narayanan Sreethar	853811be90	compaction_controller: introduce `set_max_shares()` Add a method to dynamically adjust the maximum output of control points in the compaction controller. This is required for supporting runtime configuration of the maximum shares allocated to the compaction process by the controller. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-11-24 11:43:20 -03:00
Tomasz Grabiec	d4b77c422f	Merge 'load_stats: leaving replica could be std::nullopt' from Ferenc Szili When migrating tablet size during the end_migration tablet transition stage, we need the pending and leaving replica hosts. The leaving and pending replicas are gathered in objects of type std::optional<tablet_replica> and are not checked if they contain a value before dereferencing which could cause an exception in the topology coordinator. This patch adds a check for leaving and pending replicas, and only performs the tablet size migration if neither are empty. This bug was introduced in `10f07fb95a` This change also adds the ability to create a tablet size in load_stats during end_migration stage of a tablet rebuild. We compute the new tablet size from by averaging the tablet sizes of the existing replicas. This change also adds the virtual table tablet_sizes which contains tablet sizes of all the replicas of all the tablets in the cluster. A version containing this bug has not yet been released, so a backport is not needed. Closes scylladb/scylladb#27118 * github.com:scylladb/scylladb: test: add tests for tablet size migration during end_migration virtual_table: add tablet_sizes virtual table load_stats: update tablet sizes after migration or rebuild	2025-11-24 15:31:30 +01:00
Yaron Kaikov	13eca61d41	install-on-linux.rst: update installation example to supported release Example of installation is out of date, since scylla-5.2 is EOL for long time upding the example for more recent release (together with packages update)	2025-11-24 16:22:17 +02:00
Anna Stuchlik	724dc1e582	doc: fix the info about object storage This commit fixes the information about object storage: - Object storage configuration is no longer marked as experimental. - Redundant information has been removed from the description. - Information related to object storage for SStabels has been removed as the feature is not working. Fixes https://github.com/scylladb/scylladb/issues/26985 Closes scylladb/scylladb#26987	2025-11-24 17:16:33 +03:00
Yaron Kaikov	5541f75405	docs: modify debian/ubutnu installation instructions To support debian13, we need to modify the installation instructions since `apt-key` command is no longer available Fixes: https://github.com/scylladb/scylladb/issues/26673	2025-11-24 13:33:17 +02:00
Michał Jadwiszczak	08974e1d50	db/view/view_building_worker: change internal implementation This commit doesn't change the logic behind the view building worker but it changes how the worker is executing view building tasks. Previously, the worker had a state only on shard0 and it was reacting to changes in group0 state. When it noticed some tasks were moved to `STARTED` state, the worker was creating a batch for it on the shard0 state. The RPC call was used only to start the batch and to get its result. Now, the main logic of batch management was moved to the RPC call handler. The worker has a local state on each shard and the state contains: - unique ptr to the batch - set of completed tasks - information for which views the base table was flushed So currently, each batch lives on a shard where it has its work to do exclusively. This eliminates a need to do a synchronization between shard0 and work shard, which was a painful point in previous implementation. The worker still reacts to changes in group0 view building state, but currently it's only used to observe whether any view building tasks was aborted by setting `ABORTED` state. To prepare for further changes to drop the view building task state, the worker ignores `IDLE` and `STARTED` states completely.	2025-11-24 11:12:31 +01:00
Michał Jadwiszczak	6d853c8f11	db/view/view_building_coordinator: change `work_on_tasks` RPC return type During the initial implementation of the view builing coordinator, we decided that if a view building task fails locally on the worker (example reason: view update's target replica is not available), the worker will retry this work instead of reporting a failure to the coordinator. However, we left return type of the RPC, which was telling if a task was finished successfully or aborted. But the worker doesn't need to report that a task was aborted, because it's the coordinator, who decides to abort a task. So, this commit changes the return type to list of UUIDs of completed tasks. Previously length of the returned vector needed to be the same as length of the vector sent in the request. No we can drop this restriction and the RPC handler return list of UUIDs of completed tasks (subset of vector sent in the request). This change is required to drop `STARTED` state in next commits. Since Scylla 2025.4 wasn't released yet and we're going to merge this patch before releasing, no RPC versioning or cluster feature is needed.	2025-11-24 11:12:29 +01:00
Avi Kivity	eb5e9f728c	build: lock cxxbridge-cmd version to the rest of the cxx packages rust/Cargo.toml locks the cxx packages to version 1.0.83, but install-dependencies.sh does not lock cxxbridge-cmd, part of that ecosystem. Since cxx 1.0.189 broke compatibility with 1.0.83 (understandable, as these are all sub-packages of a single repository), builds with newer cxxbridge-cmd are broken. Fix by locking cxxbridge-cmd to the same version as the other cxx subpackages. Regenerated frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz Probably better done by building cxxbridge-cmd during the build itself, but that is a deeper change. Fixes #27176 Closes scylladb/scylladb#27177	2025-11-24 07:04:53 +02:00
Avi Kivity	d6ef5967ef	tools: toolchain: prepare: replace 'reg' with 'skopeo' The prepare scripts uses 'reg' to verify we're not going to overwrite an existing image. The 'reg' command is not available in Fedora 43. Use 'skopeo' instead. Skopeo is part of the podman ecosystem so hopefully will live longer. Fixes #27178. Closes scylladb/scylladb#27179	2025-11-24 06:59:34 +02:00
Aleksandra Martyniuk	19a7d8e248	replica: database: change type of tables_metadata::_ks_cf_to_uuid If there is a lot of tables, a node reports oversized allocation in _ks_cf_to_uuid of type flat_hash_map. Change the type to std::unordered_map to prevent oversized allocations. Fixes: https://github.com/scylladb/scylladb/issues/26787. Closes scylladb/scylladb#27165	2025-11-24 06:42:40 +02:00
Botond Dénes	296d7b8595	Merge 'Enable digest+checksum verification for file based streaming' from Taras Veretilnyk This patch enables integrity check in 'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream. These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data. For compressed SSTables - where checksums are already embedded - the cost comes from reading, calculating and verifying the diges. New test cases were added to verify that the integrity checks work correctly, detecting both data and digest mismatches. Backport is not required, since it is a new feature Fixes #21776 Closes scylladb/scylladb#26702 * github.com:scylladb/scylladb: file_stream_test: add sstable file streaming integrity verification test cases streaming: prioritize sender-side errors in tablet_stream_files sstables: enable integrity check for data file streaming sstables: Add compressed raw streaming support sstables: Allow to read digest and checksum from user provided file instance sstables: add overload of data_stream() to accept custom file_input_stream_options	2025-11-24 06:37:27 +02:00
Aleksandra Martyniuk	76174d1f7a	cql3: reject ALTER KEYSPACE if rf of datacenter with tablets is omitted In ALTER KEYSPACE, when a datacenter name is omitted, its replication factor is implicitly set to zero with vnodes, while with tablets, it remains unchanged. ALTER KEYSPACE should behave the same way for tablets as it does for vnodes. However, this can be dangerous as we may mistakenly drop the whole datacenter. Reject ALTER KEYSPACE if it changes replication factor, but omits a datacenter that currently contains tablet replicas. Fixes: https://github.com/scylladb/scylladb/issues/25549. Closes scylladb/scylladb#25731	2025-11-24 06:36:51 +02:00
Avi Kivity	85db7b1caf	Merge 'address_map: Use more efficient and reliable replication method' from Tomasz Grabiec Primary issue with the old method is that each update is a separate cross-shard call, and all later updates queue behind it. If one of the shards has high latency for such calls, the queue may accumulate and system will appear unresponsive for mapping changes on non-zero shards. This happened in the field when one of the shards was overloaded with sstables and compaction work, which caused frequent stalls which delayed polling for ~100ms. A queue of 3k address updates accumulated, because we update mapping on each change of gossip states. This made bootstrap impossible because nodes couldn't learn about the IP mapping for the bootstrapping node and streaming failed. To protect against that, use a more efficient method of replication which requires a single cross-shard call to replicate all prior updates. It is also more reliable, if replication fails transiently for some reason, we don't give up and fail all later updates. Fixes #26865 Closes scylladb/scylladb#26941 * github.com:scylladb/scylladb: address_map: Use barrier() to wait for replication address_map: Use more efficient and reliable replication method utils: Introduce helper for replicated data structures	2025-11-23 19:15:12 +02:00
Avi Kivity	b0643f8959	Merge 'db/config: enable `ms` sstable format by default' from Michał Chojnowski Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes. Make them the new default. If we change our mind, this change can be reverted later. New functionality, and this is a drastic change. No backport needed. Closes scylladb/scylladb#26377 * github.com:scylladb/scylladb: db/config: enable `ms` sstable format by default cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format api/system: add /system/chosen_sstable_version test/cluster/dtest: reduce num_tokens to 16	2025-11-23 13:52:57 +02:00
Piotr Dulikowski	e8b0f8faa9	Merge 'vector search: Add HTTPS requests support' from Karol Nowacki vector_search: Add HTTPS support for vector store connections This commit introduces TLS encryption support for vector store connections. A new configuration option is added: - vector_store_encryption_options.truststore: path to the trust store file To enable secure connections, use the https:// scheme in the vector_store_primary_uri/vector_store_secondary_uri configuration options. Fixes: VECTOR-327 Backport to 2025.4 as this feature is expected to be available in 2025.4. Closes scylladb/scylladb#26935 * github.com:scylladb/scylladb: test: vector_search: Ensure all clients are stopped on shutdown vector_search: Add HTTPS support for vector store connections	2025-11-22 14:58:06 +01:00
Karol Nowacki	58456455e3	test: vector_search: Ensure all clients are stopped on shutdown A flaky test revealed that after `clients::stop()` was called, the `old_clients` collection was sometimes not empty, indicating that some clients were not being stopped correctly. This resulted in sanitizer errors when objects went out of scope at the end of the test. This patch modifies `stop()` to ensure all clients, including those in `old_clients`, are stopped, guaranteeing a clean shutdown.	2025-11-22 08:18:45 +01:00
Karol Nowacki	c40b3ba4b3	vector_search: Add HTTPS support for vector store connections This commit introduces TLS encryption support for vector store connections. A new configuration option is added: - vector_store_encryption_options.truststore: path to the trust store file To enable secure connections, use the https:// scheme in the vector_store_primary_uri/vector_store_secondary_uri configuration options. Fixes: VECTOR-327	2025-11-22 08:18:45 +01:00
Ferenc Szili	39711920eb	test: add tests for tablet size migration during end_migration This change adds tests for the correctness of tablet size migration during the end_migrations stage. This size migration can happend for tablet migrations and for tablet rebuild.	2025-11-21 16:58:11 +01:00
Ferenc Szili	e96863be0c	virtual_table: add tablet_sizes virtual table This change adds the tablet_sizes virtual table. The contents of this table are gathered from the current load_stats data structure.	2025-11-21 16:53:28 +01:00
Ferenc Szili	cede4f66af	load_stats: update tablet sizes after migration or rebuild When migrating tablet size during the end_migration tablet transition stage, we need the pending and leaving replica hosts. The leaving and pending replicas are gathered in objects of type std::optional<tablet_replica> and are not checked if they contain a value before dereferencing which could cause an exception in the topology coordinator. This patch adds a check for leaving and pending replicas, and only perfoms the tablet size migration if neither are empty. This bug was introduced in `10f07fb95a` This change also adds the functionality to add the tablet size to load_stats after a tablet rebuild. We compute the average tablet size from the existing replicas, and add the new size to the pending replica.	2025-11-21 16:22:20 +01:00
Botond Dénes	38a1b1032a	Merge 'doc: update Cloud Instance Recommendations for GCP' from Anna Stuchlik This PR: - Removes n1-highmem instances from Recommended Instances. - Adds missing support for n2-highmem-96. - Updates the reference to n2 instances in the Google Cloud docs (fixes a broken link to GCP). - Adds the missing information about processors for n2-highmem-instance - Ice Lake and Cascade Lake (requested by CX). Fixes https://github.com/scylladb/scylladb/issues/25946 Fixes https://github.com/scylladb/scylladb/issues/24223 Fixes https://github.com/scylladb/scylladb/issues/23976 No backport needed if this PR is merged before 2025.4 branching. Closes scylladb/scylladb#26182 * github.com:scylladb/scylladb: doc: update information for n2-highmem instances doc: remove n1-highmem instances from Recommended Instances	2025-11-21 16:28:54 +02:00
Anna Stuchlik	dab74471cc	doc: update information for n2-highmem instances This commit updates the section for n2-highmem instances on the Cloud Instance Recommendations page - Added missing support for n2-highmem-96 - Update the reference to n2 instances in the Google Cloud docs. - Added the missing information about processors for this instance type (Ice Lake and Cascade Lake).	2025-11-21 15:13:36 +01:00
Taras Veretilnyk	3003669c96	file_stream_test: add sstable file streaming integrity verification test cases Add 'test_sstable_stream' to verify SSTable file streaming integrity check. The new tests cover both compressed and uncompressed SSTables and includes: - Checksum mismatch detection verification - Digest mismatch detection verifivation	2025-11-21 12:52:35 +01:00
Taras Veretilnyk	77dcad9484	streaming: prioritize sender-side errors in tablet_stream_files When 'send_data_to_peer' throws and closes the sink, the peer later reports its own error, masking the original sender failure. This commit preserves the original sender exception. If the status-retrieval task throws its own error before sender task rethrows its exception, we can still propagate the original exception later.	2025-11-21 12:52:31 +01:00
Taras Veretilnyk	c8d2f89de7	sstables: enable integrity check for data file streaming This patch enables integrity check in 'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream. These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data. For compressed SSTables - where checksums are already embedded - the cost comes from reading, calculation and verifying the digest.	2025-11-21 12:52:26 +01:00
Taras Veretilnyk	18e1dbd42e	sstables: Add compressed raw streaming support Implement compressed_raw_file_data_source that streams compressed chunks without decompression while verifying checksums and calculating digests. Extends raw_stream enum to support compressed_chunks mode. This data_source implementation will be used in the next commits for file based streaming.	2025-11-21 12:52:04 +01:00
Taras Veretilnyk	c32e9e1b54	sstables: Allow to read digest and checksum from user provided file instance Add overloaded methods to read digest and checksum from user-provided file handles: - 'read_digest(file f)' - 'read_checksum(file f) This will be useful for tablet file-based streaming to enable integrity verification, as the streaming code uses SSTable snapshots with open files to prevent missing components when SSTables are unlinked.	2025-11-21 12:51:40 +01:00
Michał Chojnowski	da51a30780	db/config: enable `ms` sstable format by default Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes. Make them the new default. If we change our mind, this change can be reverted later.	2025-11-21 12:39:46 +01:00
Michał Chojnowski	73090c0d27	cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format Trie-based indexes and older indexes have a difference in metrics, and the test uses the metrics to check for bypass cache. To choose the right metrics, it uses highest_supported_sstable_format, which is inappropriate, because the sstable format chosen for writes by Scylla might be different than highest_supported_sstable_format. Use chosen_sstable_format instead.	2025-11-21 12:39:46 +01:00
Michał Chojnowski	38e14d9cd5	api/system: add /system/chosen_sstable_version Returns the sstable version currently chosen for use in for new sstables. We are adding it because some tests want to know what format they are writing (tests using upgradesstable, tests which check stats that only apply to one of the index types, etc). (Currently they are using `highest_supported_sstable_format` for this purpose, which is inappropriate, and will become invalid if a non-latest format is the default).	2025-11-21 12:39:46 +01:00
Wojciech Mitros	aacf883a8b	cql: add CONCURRENCY to the USING clause Currently, the PRUNE MATERIALIZED VIEW statement performs all its reads and writes in a single, continous sequence. This takes too much time even for a moderate amount of 'PRUNED' data. Instead, we want to make it possible to set a concurrency of the reads and writes performed while processing the PRUNE statement, so that if the user so desires, it may finish the PRUNING quicker at the cost of adding more load on the cluster. In this patch we add the CONCURRENCY setting to the USING clause in cql. In the next patch, we'll be using it to actually set the concurrency of PRUNE MATERIALIZED VIEW.	2025-11-21 12:32:52 +01:00
Botond Dénes	5c6813ccd0	test/cluster/test_repair.py: add test_repair_timestamp_difference Add a test which verifies that if two nodes have the same data, with different timestamps, repair will detect and fix the diverging timestamps. All our repair tests focus on difference in data and I remember writing this test multiple times in the past to quickly verify whether this works. Time to upstream this test. Closes scylladb/scylladb#26900	2025-11-21 14:19:51 +03:00
Botond Dénes	6f79fcf4d5	tools/scylla-nodetool: dump request history on json assert A JSON assert happens when a JSON member is either missing or has unexpected type. rapidjson has a very unhelpful "json assert failed" message for this, with a backtrace (somewhat helpful), with no other context. To help debug such errors, collect all request sent to the API and dump them when such errors happen. The backtrace with the full request history should be enough to debug any such issues. Refs CUSTOMER-17 Closes scylladb/scylladb#26899	2025-11-21 14:17:53 +03:00
Gautam Menghani	939fcc0603	db/system_keyspace: Remove the FIXME related to caching of large tables Remove the FIXME comment for re-enabling caching of the large tables since the tables are used infrequently [1]. [1] : github.com/scylladb/scylladb/pull/26789#issuecomment-3477540364 Fixes #26032 Signed-off-by: Gautam Menghani <gautam.opensource@gmail.com> Closes scylladb/scylladb#26789	2025-11-21 12:34:34 +02:00
Radosław Cybulski	d589e68642	Add precompiled headers to CMakeLists.txt Add precompiled header support to CMakeLists.txt and configure.py - it improves compilation time by approximately 10%. New header `stdafx.hh` is added, don't include it manually - the compiler will include it for you. The header contains includes from external libraries used by Scylla - seastar, standard library, linux headers and zlib. The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER` or configure.py --disable-precompiled-header to disable. The feature should be disabled, when trying to check headers - otherwise you might get false negatives on missing includes from seastar / abseil and so on. Note: following configuration needs to be added to ccache.conf: sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime Closes scylladb/scylladb#26617	2025-11-21 12:27:41 +02:00
Nadav Har'El	64a075533b	alternator: fix update of stats from wrong shard In commit `51186b2` (PR #25457) we introduced new statistics for authentication errors, and among other places we modified executor::create_table() to update them when necessary. This function runs its real work (create_table_on_shard0()) on shard 0, but incorrectly updates "_stats" from the original shard. It doesn't really matter which shard's stats we update - but it does matter that code running on shard 0 shouldn't touch some other shard's objects. Since all we do on these stats is to increment an integer, the risk of updating it on the wrong shard is minimal to non-existant, but it's still wrong and can cause bigger trouble in the future as the code continues to evolve. The fix is simple - we should pass to create_table_on_shard0() the _stats object from the acutal shard running it (shard 0). Fixes #26942 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26944	2025-11-21 11:53:06 +02:00
Calle Wilund	3c4546d839	messaging_service: Add internode_compression=rack as option Fixes #27085 Adds a "rack" option to enum/config and handles in connection setup in messaging_service. Closes scylladb/scylladb#27099	2025-11-21 11:50:55 +02:00
Nadav Har'El	66bd3dc22c	test/alternator: tests for request compression DynamoDB's documentation https://docs.aws.amazon.com/sdkref/latest/guide/feature-compression.html suggests that DynamoDB allows request bodies to be compressed (currently only by gzip). The purpose of patch is to have a test reproducing this feature. The test shows us that indeed DynamoDB understands compressed requests using the "gzip" encoding, but Alternator does not, so the new test is xfail. As you can see in the test code, although the low-level SDK (botocore) can send compress requests, this is not actually enabled for DynamoDB and we need to resort to some trickery to send compressed requests. But the point is that once we do manage to send compressed requests, the test shows us that they work properly on AWS, but fail on Alternator. The failure of the compressed requests on Alternator is reported like: An error occurred (ValidationException) when calling the PutItem operation: Parsing JSON failed: Invalid value. at 70459088 This error message should probably be improved (what is that high number?!) but of course even better would be to make it really work. By enabling tracing on alternator-server (e.g., edit test/cqlpy/run.py and add `'--logger-log-level', 'alternator-server=trace',`) we can see exactly what request the SDK sends Alternator. What we can see in the request is: 1. The request headers are uncompressed (this is expected in HTTP) 2. There is a header "Content-Encoding: gzip" 3. The request's body is binary, a full-fleged gzip output complete with a gzip magic in the beginning. Refs #5041 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27049	2025-11-21 10:48:33 +02:00
Shreyas Ganesh	4488a4fb06	docs: document sstables quarantine subdirectory Add documentation for the quarantine/ subdirectory that holds SSTables isolated due to validation failures or corruption. Document the scrub operation's quarantine_mode parameter options and the drop_quarantined_sstables API operation. Also update the directory hierarchy example to include the quarantine directory. Fixes #10742 Signed-off-by: Shreyas Ganesh <vansi.ganeshs@gmail.com> Closes scylladb/scylladb#27023	2025-11-21 10:45:33 +02:00
Ernest Zaslavsky	825d81dde2	cmake: dont complain about deprecated builtins On clang 21.1.4 (Fedora 43) the abseil compilation started to fail with `builtin XXX is deprecated use YYY instead`. Suppress this for abseil compilation only Closes scylladb/scylladb#27098	2025-11-21 10:31:54 +02:00
Botond Dénes	0cc5208f8e	Merge 'Add sstables_manager::config' from Pavel Emelyanov Currently sstables_manager keeps a reference on global db::config to configure itself. Most of other services use their own specific configs with much less data on-board for the same purposes (e.g. #24841, #19051 and #23705 did same for other services) This PR applies this approach to sstables_manager as well. Mostly it moves various values from db::config onto newly introduced struct sstables_manager::config, but it also adds specific tracking of sstable_file_io_extensions and patches tools/scylla-sstable not to use sstables_manager as "proxy" object to get db::config from along its calls. Shuffling components dependencies, no need to backport Closes scylladb/scylladb#27021 * github.com:scylladb/scylladb: sstables_manager: Drop db::config from sstables_manager tools/sstable: Make shard_of_with_tablets use db::config argument tools/sstable: Add db::config& to all operations tools/sstable: Get endpoints from storage manager sstables_manager: Hold sstable IO extensions on it sstables: Manager helper to grab file io extensions sstables_manager: Move default format on config sstables_manager: Move enable_sstable_data_integrity_check on config sstables_manager: Move data_file_directories on config sstables_manager: Move components_memory_reclaim_threshold on config sstables_manager: Move column_index_auto_scale_threshold on config sstables_manager: Move column_index_size on config sstables_manager: Move sstable_summary_ratio on config sstables_manager: Move enable_sstable_key_validation on config sstables_manager: Move available_memory on config code: Introduce sstables_manager::config sstables: Patch get_local_directories() to work on vector of paths code: Rename sstables_manager::config() into db_config()	2025-11-21 10:21:41 +02:00
Botond Dénes	f89bb68fe2	Merge 'cdc: Preserve properties when reattaching log table' from Dawid Mędrek When we enable CDC on a table, Scylla creates a log table for it. It has default properties, but the user may change them later on. Furthermore, it's possible to detach that log table by simply disabling CDC on the base table: ```cql /* Create a table with CDC enabled. The log table is created. / CREATE TABLE ks.t (pk int PRIMARY KEY) WITH cdc = {'enabled': true}; / Detach the log table. / ALTER TABLE ks.t WITH cdc = {'enabled': false}; / Modify a property of the log table. / ALTER TABLE ks.t_scylla_cdc_log WITH bloom_filter_fp_chance = 0.13; ``` The log table can also be reattached by enabling CDC on the base table again: ```cql / Reattach the log table / ALTER TABLE ks.t WITH cdc = {'enabled': true}; ``` However, because the process of reattachment goes through the same code that created it in the first place, the properties of the log table are rolled back to their default values. This may be confusing to the user and, if unnoticed, also have other consequences, e.g. affecting performance. To prevent that, we ensure that the properties are preserved. A reproducer test, `test_log_table_preserves_properties_after_reattachment`, has been provided to verify that the changes are correct. Another test, `test_log_table_preserves_id_after_reattachment`, has also been added because the current implementation sets properties and the ID separately. Fixes scylladb/scylladb#25523 Backport: not necessary. Although the behavior may be unexpected, it's not a bug per se. Closes scylladb/scylladb#26443 github.com:scylladb/scylladb: cdc: Preserve properties when reattaching log table cdc: Extract creating columns in CDC log table to dedicated function cdc: Extract default properties of CDC log tables to dedicated function schema/schema_builder.hh: Add set_properties schema: Add getter for schema::user_properties schema: Remove underscores in fields of schema::user_properties schema: Extract user properties out of raw_schema	2025-11-21 10:06:05 +02:00
Calle Wilund	03408b185e	utils::gcp::object_storage: Fix buffer alignment reordering trailing data Fixes #26874 Due to certain people (me) not being able to tell forward from backward, the data alignment to ensure partial uploads adhere to the 256k-align rule would potentially _reorder_ trailing buffers generated iff the source buffers input into the sink are small enough. Which, as a fun fact, they are in backup upload. Change the unit test to use raw sink IO and add two unit tests (of which the smaller size provokes the bug) that checks the same 64k buf segmented upload backup uses. Closes scylladb/scylladb#26938	2025-11-21 09:36:13 +02:00
Radosław Cybulski	ce8db6e19e	Add table name to tracing in alternator Add a table name to Alternator's tracing output, as some clients would like to consistently receive this information. - add missing `tracing::add_table_name` in `executor::scan` - add emiting tables' names in `trace_state::build_parameters_map` - update tests, so when tracing is looked for it is filtered by table's name, which confirms table is being outputed. - change `struct one_session_records` declaration to `class one_session_records`, as `one_session_records` is later defined as class. Refs #26618 Fixes #24031 Closes scylladb/scylladb#26634	2025-11-21 09:33:40 +02:00
Michał Chojnowski	3f11a5ed8c	test/cluster/dtest: reduce num_tokens to 16 cluster.dtest_alternator_tests.test_slow_query_logging performs a bootstrap with 768 token ranges. It works with `me` sstables, which have 2 open file descriptors per open sstable, but with `ms` sstables, which have 3 open file descriptors per open sstable, it fails with EMFILE. To avoid this problem, let's just decrease the number of vnodes for in the test suite. It's appropriate anyway, because it avoids some unneeded work without weakening the tests. (Note: pylib-based have been setting `num_tokens` to 16 for a long time too). This breaks `bypass_cache_test`, which is written in a way that expects a certain number of token ranges. We adjust the relevant parameter accordingly.	2025-11-21 00:38:50 +01:00
Piotr Dulikowski	22f22d183f	Merge 'Refine sstables_loader vs database dependency' from Pavel Emelyanov There are two issues with it. First, a small RPC helper struct carries database reference on board just to get feature state from it. Second, sstable_streamer uses database as proxy to feature service. This PR improves both. Services dependencies improvement, not need to backport Closes scylladb/scylladb#26989 * github.com:scylladb/scylladb: sstables_loader: Get LOAD_AND_STREAM_ABORT_RPC_MESSAGE from messaging sstables_loader: Keep bool on send_meta, not database reference	2025-11-20 16:13:16 +01:00
Asias He	d51b1fea94	tablets: Allow tablet merge when repair tasks exist Currently we do not allow tablet merge if either of the tablets contain a tablet repair request. This could block the tablet merge for a very long time if the repair requests could not be scheduled and executed. We can actually merge the repair tasks in most of the cases. This is because most of the time all tablets are requested to be repaired by a single API request, so they share the same task_id, request_type and other parameters. We can merge the repair task info and executes the repair after the merge. If they do not share the task info, we could not merge and have to wait for the repair before merge, which is both rare and ok. Another case is that one of the tablet has a repair task info (t1) while the other tablet (t2) does not have, it is possible the t2 has finished repair by the same repair request or t2 is not requested to be repaired at all. We allow merge in this case too to avoid blocking the tablet merge, with the price of reparing a bit more. Fixes #26844 Closes scylladb/scylladb#26922	2025-11-20 16:01:23 +01:00
Asias He	3cf1225ae6	docs: Add feature page for incremental repair Adds a new documentation page for the incremental repair feature. The page covers: - What incremental repair is and its benefits over the standard repair process. - How it works at a high level by tracking the repair status of SSTables. - The prerequisite of using the tablets architecture. - The different user-configurable modes: 'regular', 'full', and 'disabled'. Fixes #25600 Closes scylladb/scylladb#26221	2025-11-20 11:58:53 +02:00
Raphael S. Carvalho	74ecedfb5c	replica: Fail timed-out single-key read on cleaned up tablet replica Consider the following: 1) single-key read starts, blocks on replica e.g. waiting for memory. 2) the same replica is migrated away 3) single-key read expires, coordinator abandons it, releases erm. 4) migration advances to cleanup stage, barrier doesn't wait on timed-out read 5) compaction group of the replica is deallocated on cleanup 6) that single-key resumes, but doesn't find sstable set (post cleanup) 7) with abort-on-internal-error turned on, node crashes It's fine for abandoned (= timed out) reads to fail, since the coordinator is gone. For active reads (non timed out), the barrier will wait for them since their coordinator holds erm. This solution consists of failing reads which underlying tablet replica has been cleaned up, by just converting internal error to plain exception. Fixes #26229. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#27078	2025-11-20 11:44:03 +02:00
Geoff Montee	a0734b8605	Update update-topology-strategy-from-simple-to-network.rst: Multiple clarifications to page and sub-procedures Fixes #27077 Multiple points can be clarified relating to: * Names of each sub-procedure could be clearer * Requirements of each sub-procedure could be clearer * Clarify which keyspaces are relevant and how to check them * Fix typos in keyspace name Closes scylladb/scylladb#26855	2025-11-20 11:33:15 +02:00
Patryk Jędrzejczak	45ad93a52c	topology_coordinator: include all transitioning nodes in all global commands This change makes the code simpler and less vulnerable to regressions. There is no functional impact because: - we already include a decommissioning/bootstrapping/replacing node for `barrier` and `barrier_and_drain`, - we never execute global commands in the presence of a rebuilding node, - removing node always belongs to `exclude_nodes`, so it's filtered out anyway, - we execute global `stream_ranges` only for removenode, - we execute global `wait_for_ip` only for new nodes when there are no transitioning nodes. Fixes #20272 Fixes #27066 Closes scylladb/scylladb#27102	2025-11-20 11:11:32 +02:00
dependabot[bot]	2ca926f669	build(deps): bump sphinx-multiversion-scylla in /docs Bumps [sphinx-multiversion-scylla](https://holzhaus.github.io/sphinx-multiversion/) from 0.3.2 to 0.3.3. --- updated-dependencies: - dependency-name: sphinx-multiversion-scylla dependency-version: 0.3.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#27081	2025-11-20 10:28:34 +02:00
Gleb Natapov	ad3cf2c174	utils: fix get_random_time_UUID_from_micros to generate correct time uuid According to the IETF spec uuid variant bits should be set to '10'. All others are either invalid or reserved. The patch change the code to follow the spec. Closes scylladb/scylladb#27073	2025-11-20 10:27:29 +02:00
Avi Kivity	5d761373c2	Update tools/cqlsh submodule * tools/cqlsh 19445a5...2240122 (3): > copyutil: adjust multiprocessing method to 'fork' > Drop serverless/cloudconf feature > Migrate workflows to Blacksmith Closes scylladb/scylladb#27065	2025-11-20 10:26:43 +02:00
Taras Veretilnyk	e5fbe3d217	docs: improve documentation of the scrub Update nodetool scrub documentation to include --quarantine-mode and --drop-unfixable-sstables options, add a section explaining quarantine modes and provide examples and procedures for handling and removing corrupted SSTables. Closes scylladb/scylladb#27018	2025-11-20 10:26:07 +02:00
Nadav Har'El	a9cf7d08da	test/cqlpy: remove USE from test, and test a USE-related bug One of the tests in test_describe.py used "USE {test_keyspace}" which affects the CQL session shared by all tests in an unrecoverable way (there is no "UNUSE" statement). As an example of what might happen if the shared CQL session is "polluted" by a USE, issue #26334 is about a bug we have in DESC KEYSPACES when USE is active. So in this patch, we: 1. Fix the test to not use USE on the shared CQL session - it's easy to create a separate session to use the "USE" on. With this fix, the test no longer leaves the shared CQL session in a "USE" state. 2. Add a new xfailing test to reproduce the DESC KEYSPACES bug. Refs #26334 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26345	2025-11-20 10:25:03 +02:00
Botond Dénes	a084094c18	Merge 'alternator and cql: tests for default sstable compression' from Nadav Har'El The purpose of this two-patch series is to reproduce a previously unknown bug, Refs #26914. Recently we saw a lot of patches that change how we create new schemas (keyspaces and tables), sometimes changing various long-time defaults. We started to worry that perhaps some of these defaults were applied only to CQL base tables and perhaps not to Alternator or to CQL's auxiliary tables (materialized views, secondary indexes, or CDC logs). For example, in Refs #26307 we wondered if perhaps the default "speculative_retry" option is different in Alternator than in CQL. The first patch includes Alternator tests, and the second CQL tests. In both tests we discover that although recently (commit `adf9c42`, Refs #26610) we changed the default sstable compressor from LZ4Compressor to LZ4WithDictsCompressor, actually this change was only applied to CQL base tables. All Alternator tables and all CQL auxiliary tables (views, indexes, CDC) still use the old LZ4Compressor. This is issue #26914. Closes scylladb/scylladb#26915 * github.com:scylladb/scylladb: test/cqlpy: test compression setting for auxiliary table test/alternator: tests for schema of Alternator table	2025-11-20 10:24:31 +02:00
Karol Nowacki	104de44a8d	vector_search: Add support for secondary vector store clients This change adds support for secondary vector store clients, typically located in different availability zones. Secondary clients serve as fallback targets when all primary clients are unavailable. New configuration option allows specifying secondary client addresses and ports. Fixes: VECTOR-187 Closes scylladb/scylladb#26484	2025-11-20 08:37:18 +01:00
Pavel Emelyanov	1cabc8d9b0	Merge 'streaming: fix loop break condition in tablet_sstable_streamer::stream' from Ernest Zaslavsky When streaming SSTables by tablet range, the original implementation of tablet_sstable_streamer::stream may break out of the loop too early when encountering a non-overlaping SSTable. As a result, subsequent SSTables that should be classified as partially contained are skipped entirely. Tablet range: [4, 5] SSTable ranges: [0,5] [0, 3] <--- is considered exhausted, and causes skip to next tablet [2, 5] <--- is missed for range [4, 5] The loop uses if (!overlaps) break; semantics, which conflated “no overlap” with “done scanning.” This caused premature termination when an SSTable did not overlapped but the following one did. Correct logic should be: before(sst_last) → skip and continue. after(sst_first) → break (no further SSTables can overlap). Otherwise → `contains` to classify as full or partial. Missing SSTables in streaming and potential data loss or incomplete streaming in repair/streaming operations. 1. Correct the loop termination logic that previously caused certain SSTables to be prematurely excluded, resulting in lost mutations. This change ensures all relevant SSTables are properly streamed and their mutations preserved. 2. Refactor the loop to use before() and after() checks explicitly, and only break when the SSTable is entirely after the tablet range 3. Add pytest to cover this case, full streaming flow by means of `restore` 4. Add boost tests to test the new refactored function This data corruption fix should be ported back to 2024.2, 2025.1, 2025.2, 2025.3 and 2025.4 Fixes: https://github.com/scylladb/scylladb/issues/26979 Closes scylladb/scylladb#26980 * github.com:scylladb/scylladb: streaming: fix loop break condition in tablet_sstable_streamer::stream streaming: add pytest case to reproduce mutation loss issue	2025-11-20 10:16:17 +03:00
Piotr Dulikowski	dc7944ce5c	Merge 'vector_search: Fix error handling and status parsing' from Karol Nowacki vector_search: Fix error handling and status parsing This change addresses two issues in the vector search client that caused validator test failures: incorrect handling of 5xx server errors and faulty status response parsing. 1. 5xx Error Handling: Previously, a 5xx response (e.g., 503 Service Unavailable) from the underlying vector store for an `/ann` search request was incorrectly interpreted as a node failure. This would cause the node to be marked as down, even for transient issues like an index scan being in progress. This change ensures that 5xx errors are treated as transient search failures, not node failures, preventing nodes from being incorrectly marked as down. 2. Status Response Parsing: The logic for parsing status responses from the vector store was flawed. This has been corrected to ensure proper parsing. Fixes: SCYLLADB-50 Backport to 2025.4 as this problem is present on this branch. Closes scylladb/scylladb#27111 * github.com:scylladb/scylladb: vector_search: Don't mark nodes as down on 5xx server errors test: vector_search: Move unavailable_server to dedicated file vector_search: Fix status response parsing	2025-11-20 08:14:28 +01:00
Botond Dénes	6ee0f1f3a7	Merge 'replica/table: add a metric for hypothetical total file size without compression' from Michał Chojnowski This patch adds a metric for pre-compression size of sstable files. This patch adds a per-table metric `scylla_column_family_total_disk_space_before_compression`, which measures the hypothetical total size of sstables on disk, if Data.db was replaced with an uncompressed equivalent. As for the implementation: Before the patch, tables and sstable sets are already tracking their total physical file size. Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats. To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size. New functionality, no backport needed. Closes scylladb/scylladb#26996 * github.com:scylladb/scylladb: replica/table: add a metric for hypothetical total file size without compression replica/table: keep track of total pre-compression file size	2025-11-20 09:10:38 +02:00
Karol Nowacki	9563d87f74	vector_search: Don't mark nodes as down on 5xx server errors For an `/ann` search request, a 5xx server response does not indicate that the node is down. It can signify a transient state, such as the index full scan being in progress. Previously, treating a 503 error as a node fault would cause the node to be incorrectly marked as down, for example, when a new index was being created. This commit ensures that such errors are treated as transient search failures, not node failures.	2025-11-20 08:10:20 +01:00
Karol Nowacki	366ecef1b9	test: vector_search: Move unavailable_server to dedicated file The unavailable_server code will be reused in upcoming client unit tests.	2025-11-20 08:09:21 +01:00
Benny Halevy	8ed36702ae	Update seastar submodule * seastar 63900e03...340e14a7 (19): > Merge 'rpc: harden sink_impl::close()' from Benny Halevy rpc: sink_impl::close: fixup indentation rpc: harden sink_impl::close() > http: Document the way "unread body bytes" accounting works > net: tighten port load balancing port access > coroutine: reimplement generator with buffered variant > Merge 'Stop using net::packet in posix data sink' from Pavel Emelyanov net/posix-stack: Don't use packet in posix_data_sink_impl reactor: Move fragment-vs-iovec static assertion reactor: Make backend::sendmsg() calls use std::span<iovec> utils: Introduce iovec_trim_front helper utils: Spannize iovec_len() > Merge 'Generalize memory data sink in tests' from Pavel Emelyanov test: Make output_stream_test splitting test case use own sink test: Make some output_stream_test cases use memory data sink test: Threadify one of output_stream_test test cases test: Make json_formatter_test use memory_data_sink test: Move memory_data_sink to its own header > dns: avoid using deprecated c-ares API > reactor: Move read_directory() to posix_file_impl > Merge 'rpc: sink_impl: batch sending and deletion of snd_buf:s' from Benny Halevy test: rpc_test: add test_rpc_stream_backpressure_across_shards reactor: add abort_on_too_long_task_queue option rpc: make sink flush and close noexcept rpc: sink_impl: batch sending and deletion of snd_buf:s rpc: move sink_impl and source_impl into internal namespace rpc: sink_impl: extend backpressure until snd_buf destroy > configure.py: fix --api-level help > Merge 'Close http client connection if handler doesn't consume all chunked-encoded body' from Pavel Emelyanov test: Fix indentation after previous patch test/http: Extend test for improper client handling of aborted requests test/http: Ignore EPIPE exception from server closing connection test/http: Split the partial response body read test http: Track "remaining bytes" for chunked_source_impl http: Switch content_length_source_impl to update remaining bytes > metrics: Add default ~config() > headers: Remove smp.hh from app-template.hh > prometheus: remove hostname and metric_help config > rpc: Tune up connection methods visibility > perf_tests: Fix build with fmt 12.0.0 by avoiding internal functions > doc: Fix some typos in codying style > reactor: Remove unused try_sleep() method directory_lister::get is adjusted in this patch to use the new experimental::coroutine::generator interface that was changed in scylladb/seastar@81f2dc9dd9 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26913	2025-11-20 07:29:47 +03:00
Pavel Emelyanov	53b71018e8	Merge 'Alternator: additional tests for ExclusiveStartKey' from Nadav Har'El In pull request #26960 there was some discussion about what is the valid form of ExclusiveStartKey, and whether we need to allow some "non-standard" uses of it for scan over system tables (which aren't real Alternator tables and may have multiple key columns, something not possible in normal Altenrator tables). This made me realize our tests for what is allowed - and what is not allowed - in ExclusiveStartKey - are very sparse and don't cover all the cases that are possible in Scan and Query, in base tables and in GSIs. So this small series attempts to increase the coverage of the tests for ExclusiveStartKey to make sure we are compatible with DynamoDB and also that we don't regress in #26960. The new tests reproduce a previously unknown error-path issues, #26988, where in some cases DynamoDB considers ExclusiveStartKey to be invalid but Alternator erronously accepts. Fortunately, we didn't find any success-path (correctness) bugs. Closes scylladb/scylladb#26994 * github.com:scylladb/scylladb: test/alternator: tests for ExclusiveStartKey in GSI test/alternator: more tests for ExclusiveStartKey in Scan test/alternator: more tests for ExclusiveStartKey in Query	2025-11-20 07:21:39 +03:00
Avi Kivity	0d68512b1f	stall_free: make variadic dispose_gently sequential Having variadic dispose_gently() clear inputs concurrently serves no purpose, since this is a CPU bound operation. It will just add more tasks for the reactor to process. Reduce disruption to other work by processing inputs sequentially. Closes scylladb/scylladb#26993	2025-11-20 07:16:16 +03:00
Benny Halevy	fd81333181	test/pylib/cpp: increase max-networking-io-control-blocks value Increase the value of the max-networking-io-control-blocks option for the cpp tests as it is too low and causes flakiness as seen in vector_search.vector_store_client_test.vector_store_client_single_status_check_after_concurrent_failures: ``` seastar/src/core/reactor_backend.cc:342: void seastar::aio_general_context::queue(linux_abi::iocb *): Assertion `last < end` failed. ``` See also https://github.com/scylladb/seastar/issues/976 Fixes #27056 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27117	2025-11-20 04:31:36 +01:00
Ernest Zaslavsky	dedc8bdf71	streaming: fix loop break condition in tablet_sstable_streamer::stream Correct the loop termination logic that previously caused certain SSTables to be prematurely excluded, resulting in lost mutations. This change ensures all relevant SSTables are properly streamed and their mutations preserved.	2025-11-19 17:32:49 +02:00
Tomasz Grabiec	f83c4ffc68	address_map: Use barrier() to wait for replication More efficient than 100 pings. There was one ping in test which was done "so this shard notices the clock advance". It's not necessary, since obsering completed SMP call implies that local shard sees the clock advancement done within in.	2025-11-19 15:21:02 +01:00
Tomasz Grabiec	4a85ea8eb2	address_map: Use more efficient and reliable replication method Primary issue with the old method is that each update is a separate cross-shard call, and all later updated queue behind it. If one of the shards has high latency for such calls, the queue may accumulate and system will appear unresponsive for mapping changes on non-zero shards. This happened in the field when one of the shards was overloaded with sstables and compaction work, which caused frequent stalls which delayed polling for ~100ms. A queue of 3k address updates accumulated. This made bootstrap impossible, since nodes couldn't learn about the IP mapping for the bootstrapping node and streaming failed. To protect against that, use a more efficient method of replication which requires a single cross-shard call to replicate all prior updates. It is also more reliable, if replication fails transiently for some reason, we don't give up and fail all later updates. Fixes #26865 Fixes #26835	2025-11-19 15:21:02 +01:00
Tomasz Grabiec	ed8d127457	utils: Introduce helper for replicated data structures Key goals: - efficient (batching updates) - reliable (no lost updates) Will be used in data structures maintained on one designed owning shard and replicated to other shards.	2025-11-19 15:21:02 +01:00
Michał Chojnowski	d8e299dbb2	sstables/trie/trie_writer: free nodes after they are flushed Somehow, the line of code responsible for freeing flushed nodes in `trie_writer` is missing from the implementation. This effectively means that `trie_writer` keeps the whole index in memory until the index writer is closed, which for many dataset is a guaranteed OOM. Fix that, and add some test that catches this. Fixes scylladb/scylladb#27082 Closes scylladb/scylladb#27083	2025-11-19 14:54:16 +02:00
Karol Nowacki	05b9cafb57	vector_search: Fix status response parsing The response was incorrectly parsed as a plain string and compared directly with C++ string. However, the body contains a JSON string, which includes escaped quotes that caused comparison failures.	2025-11-19 10:02:05 +01:00
Nadav Har'El	7b9428d8d7	test/cqlpy: test compression setting for auxiliary table In the previous patch we noticed that although recently (commit `adf9c42`, Refs #26610) we changed the default sstable compressor from LZ4Compressor to LZ4WithDictsCompressor, this change was only applied to CQL, not to Alternator. In this patch we add tests that demonstrate that it's even worse - the new compression only applies to CQL's base table - all the "auxiliary" tables - * Materialized views * Secondary index's materialized views * CDC log tables all still have the old LZ4Compressor, different from the base table's default compressor. The new test fails on Scylla, reproducing #26914, and passes on Cassandra (on Cassandra, we only compare the materialized view table, because SI and CDC is implemented differently). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-19 09:18:37 +02:00
Nadav Har'El	11f6a25d44	test/alternator: tests for schema of Alternator table This patch introduces a new test that exposed a previously unknown bug, Refs #26914: Recently we saw a lot of patches that change how we create new schemas (keyspaces and tables), sometimes changing various long-time defaults. We started to worry that perhaps some of these defaults were applied only to CQL and not to Alternator. For example, in Refs #26307 we wondered if perhaps the default "speculative_retry" option is different in Alternator than in CQL. This patch includes a new test file test/alternator/test_cql_schema.py, with tests for verifying how Alternator configures the underlying tables it creates. This test shows that the "speculative_retry" doesn't have this suspected bug - it defaults to "99.0PERCENTILE" in both CQL and Alternator. But unfortunately, we do have this bug with the "compression" option: It turns out that recently (commit `adf9c42`, Refs #26610) we changed the default sstable compressor from LZ4Compressor to LZ4WithDictsCompressor, but the change was only applied to CQL, not Alternator. So the test that "compression" is the same in both fails - and marked "xfails" and I created a new issue to track it - #26914. Another test verifies that Alternators "auxiliary" tables - holding GSIs, LSIs and Streams - have the same default properties as the base table. This currently seems to hold (there is no bug). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-19 09:18:37 +02:00
Pavel Emelyanov	4d5f7a57ea	sstables_loader: Get LOAD_AND_STREAM_ABORT_RPC_MESSAGE from messaging The feature in question is about the way streaming sink-and-source operate. Since sink-and-source itself are obtained from messaging service, the feature is better be fetched from it too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-19 09:35:54 +03:00
Pavel Emelyanov	64e099f03b	sstables_loader: Keep bool on send_meta, not database reference The send_meta helper needs database reference to get feature_service from it (to check some feature state). That's too much, features are "immutable" throug the loader lifetime, it's enough to keep the boolean member on send_meta. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-19 09:32:56 +03:00
Patryk Jędrzejczak	e35ba974ce	test: test_raft_recovery_stuck: ensure mutual visibility before using driver Not waiting for nodes to see each other as alive can cause the driver to fail the request sent in `wait_for_upgrade_state()`. scylladb/scylladb#19771 has already replaced concurrent restarts with `ManagerClient.rolling_restart()`, but it has missed this single place, probably because we do concurrent starts here. Fixes #27055 Closes scylladb/scylladb#27075	2025-11-19 05:54:12 +01:00
David Garcia	3f2655a351	docs: add liveness::MustRestart support Closes scylladb/scylladb#27079	2025-11-18 15:28:55 +01:00
Szymon Wasik	f714876eaf	Add documentation about lack of returning similarity distances This patch adds the missing warning about the lack of possibility to return the similarity distance. This will be added in the next iteration. Fixes #27086 It has to be backported to 2025.4 as this is the limitation in 2025.4. Closes scylladb/scylladb#27096	2025-11-18 13:50:36 +01:00
Ernest Zaslavsky	656ce27e7f	streaming: add pytest case to reproduce mutation loss issue Introduce a test that demonstrates mutation loss caused by premature loop termination in tablet_sstable_streamer::stream. The code broke out of the SSTable iteration when encountering a non-overlapping range, which skipped subsequent SSTables that should have been partially contained. This test showcases the problem only. Example: Tablet range: [4, 5] SSTable ranges: [0,5] [0, 3] <--- is considered exhausted, and causes skip to next tablet [2, 5] <--- is missed for range [4, 5]	2025-11-18 09:34:41 +02:00
Avi Kivity	f7413a47e4	sstables: writer: avoid recursion in variadic write() Following `9b6ce030d0` ("sstables: remove quadratic (and possibly exponential) compile time in parse()"), where we removed recursion in reading, we do the same here for variadic write. This results in a small reduction in compile time. Note the problem isn't very bad here. This is tail-recursion, so likely removed by the compiler during optimization, and we don't have additional amplification due to future::then() double-compiling the ready-future and unready-future paths. Still, better to avoid quadratic compile times. Closes scylladb/scylladb#27050	2025-11-18 08:17:17 +02:00
Botond Dénes	2ca66133a4	Revert "db/config: don't use RBNO for scaling" This reverts commit `43738298be`. This commit causes instability in dtests. Several non-gating dtests started failing, as well as some gating ones, see #27047. Closes scylladb/scylladb#27067 Fixes #27047	2025-11-18 08:17:17 +02:00
Botond Dénes	0dbad38eed	Merge 'docs/dev/topology-over-raft: make various updates' from Patryk Jędrzejczak The updates include: - adding missing parts like topology states and table rows, - documenting zero-token nodes, - replacing the old recovery procedure with the new one. Fixes #26412 Updates of internal docs (usually read on master) don't require backporting. Closes scylladb/scylladb#27022 * github.com:scylladb/scylladb: docs/dev/topology-over-raft: update the recovery section docs/dev/topology-over-raft: document zero-token nodes docs/dev/topology-over-raft: clarify the lack of tablet-specific states docs/dev/topology-over-raft: add the missing join_group0 state docs/dev/topology-over-raft: update the topology columns	2025-11-18 08:17:17 +02:00
Patryk Jędrzejczak	adaa0560d9	Merge 'Automatic cleanup improvements' from Gleb Natapov This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously. Fixes https://github.com/scylladb/scylladb/issues/26866 Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing. Closes scylladb/scylladb#26868 * https://github.com/scylladb/scylladb: cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster cleanup: Add RESTful API to allow reset cleanup needed flag	2025-11-18 08:17:17 +02:00
Pavel Emelyanov	02513ac2b8	alternator: Get feature service from proxy directly The executor::add_stream_options() obtains local database reference from proxy just to get feature service from it. Similar chain is used in executor::update_time_to_live(). It's shorter to get features from proxy itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26973	2025-11-18 08:17:16 +02:00
Botond Dénes	514c1fc719	Merge 'db: batchlog_manager: update _last_replay only if all batches were re…' from Aleksandra Martyniuk …played Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection. Update _last_replay only if all batches were replayed. Fixes: https://github.com/scylladb/scylladb/issues/24415. Needs backport to all live versions. Closes scylladb/scylladb#26793 * github.com:scylladb/scylladb: test: extend test_batchlog_replay_failure_during_repair db: batchlog_manager: update _last_replay only if all batches were replayed	2025-11-18 08:17:16 +02:00
Nadav Har'El	5b78e1cebe	test/alternator: tests for ExclusiveStartKey in GSI After in the previous patches we added more exhaustive testing for the ExclusiveStartKey feature of Query and Scan, in this patch we add tests for this feature in the context of GSIs. Most interestingly, the ExclusiveStartKey when querying a GSI isn't just the key of the GSI, but also includes the key columns of the base - in other words, it is the key that Scylla uses for its materialized view. The tests here confirm that paging on GSI works - this paging uses ExclusiveStartKey of course - but also what is the specific structure and meaning of the content of ExclusiveStartKey. We also include two xfailing tests which again, like in the previous patches, show we don't do enough validation (issue #26988) and don't recognize wrong values or spurious columns in ExclusiveStartKey. As usual, all new tests pass on DynamoDB, and all except the xfailing ones pass on Alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-17 22:07:28 +02:00
Nadav Har'El	65b364d94a	test/alternator: more tests for ExclusiveStartKey in Scan In the previous patch we added more tests for ExclusiveStartKey in the context of the "Query" request. Here we do a similar thing for "Scan". There are fewer error cases for Scan. In particular, while it didn't make sense to use ExclusiveStartKey on a Query on a table without a sort key (since a Query there always returns a single item), for Scan it's needed - for paging. So we add in this patch a test (that we didn't have before!) that Scan paging works correctly also in the case of a table without a sort key. This patch has one xfailing test reproducing #26988, that we don't recognize and refuse spurious columns (columns not in the key) in ExclusiveStartKey. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-17 22:07:28 +02:00
Nadav Har'El	c049992a93	test/alternator: more tests for ExclusiveStartKey in Query We already have in test/alternator/test_query.py a test - test_query_exclusivestartkey - for one successful uses of ExclusiveStartKey. But we missed testing quite a few edge cases of this parameter, so this patch adds more tests for it - see the comments on each individual test explaining its purpose. With the new tests, we actually identified three cases where we got the error handling wrong - cases of ExclusiveStartKey which DynamoDB refuses, but Alternator allows. So three of the tests included here pass on DynamoDB but fail on Alternator, so are marked with "xfail". Refs #26988 - which is a new issue about these three cases of missing validation. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-17 22:07:27 +02:00
Andrzej Jackowski	35fd603acd	test: wait for read_barrier in wait_until_driver_service_level_created Previously, `wait_until_driver_service_level_created` only waited for the `driver` service level to appear in the output of `LIST ALL SERVICE_LEVELS`. However, the fact that one node lists `sl:driver` does not necessarily mean that all other nodes can see it yet. This caused sporadic test failures, especially in DEBUG builds. To prevent these failures, this change adds an extra wait for a `raft/read_barrier` after the `driver` service level first appears. This ensures the service level is globally visible across the cluster. Fixes: scylladb/scylladb#27019	2025-11-17 15:21:28 +01:00
Andrzej Jackowski	39bfad48cc	test: use ManagerClient in wait_until_driver_service_level_created Pass a ManagerClient instead of a `cql` session to `wait_until_driver_service_level_created`. This makes it easier to add additional functionality to the helper later (e.g. waiting for a Raft read barrier in a subsequent commit). Refs: scylladb/scylladb#27019	2025-11-17 14:55:14 +01:00
Botond Dénes	d54d409a52	Merge 'audit: write out to both table and syslog' from Dario Mirovic This patch adds support for multiple audit log outputs. If only one audit log output is enabled, the behavior does not change. If multiple audit log outputs are enabled, then the `audit_composite_storage_helper` class is used. It has a collection of `storage_helper` objects. Performance testing shows that read query throughput and auth request throughput are consistent even at high reactor utilization. It can also be observed that read query latency increases a bit. Read query ops = 60k/s AUTH ops = 200/s \| Audit Mode \| QUERY latency (p99) \| Δ% vs none \| \|------------\|---------------------\|------------\| \| none \| 777 \| 0 \| \|table\| 801 \| +3.09% \| \|syslog \| 803 \| +3.35% \| \|table,syslog \| 818 \| +5.28% \| Read query ops = 50k/s AUTH ops = 200/s \| Audit Mode \| QUERY latency (p99) \| Δ% vs none \| \|------------\|---------------------\|------------\| \| none \| 643 \| 0 \| \|table\| 647 \| +0.62% \| \|syslog \| 648 \| +0.78% \| \|table,syslog \| 656 \| +2.02% \| Detailed performance results are in the following Confluence document: [Audit performance impact test](https://scylladb.atlassian.net/wiki/spaces/RND/pages/148308005/Audit+performance+impact+test) Fixes #26022 Backport: The decision is to not backport for now. After making sure it works on the latest release, and if there is a need, we can do it. Closes scylladb/scylladb#26613 * github.com:scylladb/scylladb: test: dtest: audit_test.py: add AuditBackendComposite test: dtest: audit_test.py: group logs in dict per audit mode audit: write out to both table and syslog audit: move storage helper creation from `audit::start` to `audit::audit` audit: fix formatting in `audit::start_audit` audit: unify `create_audit` and `start_audit`	2025-11-17 15:04:15 +02:00
Gleb Natapov	0f0ab11311	cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster `97ab3f6622` changed "nodetool cleanup" (without arguments) to run cleanup on all dirty nodes in the cluster. This was somewhat unexpected, so this patch changes it back to run cleanup on the target node only (and reset "cleanup needed" flag afterwards) and it adds "nodetool cluster cleanup" command that runs the cleanup on all dirty nodes in the cluster.	2025-11-17 15:00:51 +02:00
Piotr Dulikowski	c29efa2cdb	Merge 'vector_search: Improve vector-store health checking' from Karol Nowacki A Vector Store node is now considered down if it returns an HTTP 500 server error. This can happen, for example, if the node fails to connect to the database or has not completed its initial full scan. The logic for marking a node as 'up' is also enhanced. A node is now only considered up when its status is explicitly 'SERVING'. Fixes: VECTOR-187 Backport to 2025.4 as this feature is expected to be available in 2025.4. Closes scylladb/scylladb#26413 * github.com:scylladb/scylladb: vector_search: Improve vector-store health checking vector_search: Move response_content_to_sstring to utils.hh vector_search: Add unit tests for client error handling vector_search: Enable mocking of status requests vector_search: Extract abort_source_timeout and repeat_until vector_search: Move vs_mock_server to dedicated files	2025-11-17 12:16:07 +01:00
Dawid Mędrek	0602afc085	cdc: Preserve properties when reattaching log table When we enable CDC on a table, Scylla creates a log table for it. It has default properties, but the user may change them later on. Furthermore, it's possible to detach that log table by simply disabling CDC on the base table: ```cql /* Create a table with CDC enabled. The log table is created. / CREATE TABLE ks.t (pk int PRIMARY KEY) WITH cdc = {'enabled': true}; / Detach the log table. / ALTER TABLE ks.t WITH cdc = {'enabled': false}; / Modify a property of the log table. / ALTER TABLE ks.t_scylla_cdc_log WITH bloom_filter_fp_chance = 0.13; ``` The log table can also be reattached by enabling CDC on the base table again: ```cql / Reattach the log table */ ALTER TABLE ks.t WITH cdc = {'enabled': true}; ``` However, because the process of reattachment goes through the same code that created it in the first place, the properties of the log table are rolled back to their default values. This may be confusing to the user and, if unnoticed, also have other consequences, e.g. affecting performance. To prevent that, we ensure that the properties are preserved. A reproducer test, `test_log_table_preserves_properties_after_reattachment`, has been provided to verify that the changes are correct. It fails before this commit. Another test, `test_log_table_preserves_id_after_reattachment`, has also been added because the current implementation sets properties and the ID separately. Fixes scylladb/scylladb#25523	2025-11-17 11:56:30 +01:00
Dawid Mędrek	10975bf65c	cdc: Extract creating columns in CDC log table to dedicated function We extract the portion of the code responsible for creating columns in a CDC log table to a separate, dedicated function. This should improve the overall readability of the function (and also making it very short now).	2025-11-17 11:54:48 +01:00
Dawid Mędrek	8bf09ac6f7	cdc: Extract default properties of CDC log tables to dedicated function We extract the portion of the code responsible for setting the default properties of a CDC log table to a separate function. This should improve the overall readability of the function. Also, it should be helpful when modifying the code later on in this commit series.	2025-11-17 11:50:35 +01:00
Dawid Mędrek	991c0f6e6d	schema/schema_builder.hh: Add set_properties We add a method used for overwriting the properties of a schema. It will be used to create a new schema based on another.	2025-11-17 11:46:32 +01:00
Dawid Mędrek	76b21d7a5a	schema: Add getter for schema::user_properties The getter will be used later to access the user properties and copy them to a fresh `schema_builder`.	2025-11-17 11:46:24 +01:00
Dawid Mędrek	3856c9d376	schema: Remove underscores in fields of schema::user_properties The fields are public, so according to the style guide, they should not start with an underscore.	2025-11-17 11:46:15 +01:00
Dawid Mędrek	5a0fddc9ee	schema: Extract user properties out of raw_schema The properties can be directly manipulated by the user via statements like `ALTER TABLE`. To better organize the structure of `raw_schema`, we encapsulate that data in the form of a dedicated struct. This change will be later used for applying multiple properties to `schema_builder` in one go.	2025-11-17 11:46:07 +01:00
Patryk Jędrzejczak	b5f38e4590	docs/dev/topology-over-raft: update the recovery section We have the new recovery procedure now, but this doc hasn't been updated. It still describes the old recovery procedure. For comparison, external docs can be found here: https://docs.scylladb.com/manual/master/troubleshooting/handling-node-failures.html#manual-recovery-procedure Fixes #26412	2025-11-17 10:40:23 +01:00
Patryk Jędrzejczak	785a3302e6	docs/dev/topology-over-raft: document zero-token nodes The topology transitions are a bit different for zero-token nodes, which is worth mentioning.	2025-11-17 10:40:23 +01:00
Patryk Jędrzejczak	d75558e455	docs/dev/topology-over-raft: clarify the lack of tablet-specific states Tablets are never mentioned before this part of the doc, so it may be confusing why some topology states are missing.	2025-11-17 10:40:23 +01:00
Patryk Jędrzejczak	c362ea4dcb	docs/dev/topology-over-raft: add the missing join_group0 state This state was added as a part of the join procedure, and we didn't update this part of the doc.	2025-11-17 10:40:23 +01:00
Patryk Jędrzejczak	182d416949	docs/dev/topology-over-raft: update the topology columns Some of the columns were added, but the doc wasn't updated. `upgrade_state` was updated in only one of the two places. `ignore_nodes` was changed to a static column.	2025-11-17 10:40:20 +01:00
Piotr Dulikowski	f0039381d2	Merge 'db/view/view_building_worker: support staging sstables intra-node migration and tablet merge' from Michał Jadwiszczak This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge. To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair. There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine. For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard. The patch should be backported to 2025.4 Fixes https://github.com/scylladb/scylladb/issues/26244 Closes scylladb/scylladb#26454 * github.com:scylladb/scylladb: service/storage_service: migrate staging sstables in view building worker during intra-node migration db/view/view_building_worker: support sstables intra-node migration db/view_building_worker: fix indent db/view/view_building_worker: don't organize staging sstables by last token	2025-11-17 08:53:19 +01:00
Karol Nowacki	7f45f15237	vector_search: Improve vector-store health checking A Vector Store node is now considered down if it returns an HTTP 5xx status. This can happen, for example, if the node fails to connect to the database or has not completed its initial full scan. The logic for marking a node as 'up' is also enhanced. A node is now only considered up when its status is 'SERVING'.	2025-11-17 06:21:31 +01:00
Karol Nowacki	5c30994bc5	vector_search: Move response_content_to_sstring to utils.hh Move the response_content_to_sstring utility function from vector_store_client.cc to utils.hh to enable reuse across multiple files. This refactoring prepares for the upcoming `client.cc` implementation that will also need this functionality.	2025-11-17 06:21:31 +01:00
Karol Nowacki	4bbba099d7	vector_search: Add unit tests for client error handling Introduce dedicated unit tests for the client class to verify existing functionality and serve as regression tests. These tests ensure that invalid client requests do not cause nodes to be marked as down.	2025-11-17 06:21:31 +01:00
Karol Nowacki	cb654d2286	vector_search: Enable mocking of status requests Extend the mock server to allow inspecting incoming status requests and configuring their responses. This enables client unit tests to simulate various server behaviors, such as handling node failures and backoff logic.	2025-11-17 06:21:31 +01:00
Karol Nowacki	f665564537	vector_search: Extract abort_source_timeout and repeat_until The `abort_source_timeout` and `repeat_until` functions are moved to the shared utility header `test/vector_search/utils.hh`. This allows them to be reused by upcoming `client` unit tests, avoiding code duplication.	2025-11-17 06:21:31 +01:00
Karol Nowacki	ee3b83c9b0	vector_search: Move vs_mock_server to dedicated files The mock server utility is extracted into its own files so it can be reused by future `client` unit tests.	2025-11-17 06:21:30 +01:00
Artsiom Mishuta	696596a9ef	test.py: shutdown ManagerClient only in current loop In python 3.14 there is stricter policy regarding asyncio loops. This leads that we can not close clients from different loops. This change ensures that we are closing only client in the current loop. Closes scylladb/scylladb#26911	2025-11-16 19:19:46 +02:00
Jenkins Promoter	3672715211	Update pgo profiles - x86_64	2025-11-16 11:42:41 +02:00
Jenkins Promoter	41933b3f5d	Update pgo profiles - aarch64	2025-11-15 05:27:38 +02:00
Pavel Emelyanov	9cb776dee8	sstables_manager: Drop db::config from sstables_manager Now it has all it needs via its own specific config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	d55044b696	tools/sstable: Make shard_of_with_tablets use db::config argument Its caller, the shard_of_operation, already has it as argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	2ec3303edd	tools/sstable: Add db::config& to all operations It's not extremely elegant, but one tool operation needs db::config -- the "shard of" one. Currently it gets one from sstables_manager, but manager is going to stop using db::config, and the operation needs to get it elsehow. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	0fede18447	tools/sstable: Get endpoints from storage manager The tool may open sstables on S3. For that it gets configured endpoints with the help of db::config obtained from sstables_manager.db_config(). However, storage endpoints are maintained by sstables storage manager, and since tool has this instance, it's better to use storage manager to get list of endpoints. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	675eb3be98	sstables_manager: Hold sstable IO extensions on it Currently manager holds a reference on db::config and when sstables IO extensions are needed it grabs them from this config. Since db::config is going to be removed from sstables manager, it should either keep track of all config extensions, or only those that it needs. This patch makes the latter choice and keeps reference to sstable_file_io_ext. on manager. The reference is passed as constructor argument, not via manager config, but it's a random choice, no specific reason why not putting it on config itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	c853197281	sstables: Manager helper to grab file io extensions Currently all the code that needs to iterate over sstables extensions get config from manager, extensions from it and then iterate. Add a helper that returns extensions directly. No real changes, just a helper. Next patch will change the way the helper works. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	9868341c73	sstables_manager: Move default format on config It's explicitly `me` type by default, but places that can write sstables override it with db::config value: replica::database, tests and scylla sstable tool. Live-updateable, so use updateable_value<> type. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	e6dee8aab5	sstables_manager: Move enable_sstable_data_integrity_check on config Set its default value to the one from db/config.cc. Only replica::database may want to re-configure it. Also not live-updateable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	78ab31118e	sstables_manager: Move data_file_directories on config Make it a reference, so all the code that configures it is updated to provide the target. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	cb1679d299	sstables_manager: Move components_memory_reclaim_threshold on config Set its default value to the one from db/config.cc. Only the replica::database and tests may want to re-configure it. This one is live-updateable, so use updateable_value<> type. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:42 +03:00
Botond Dénes	8579e20bd1	Merge 'Enable digest+checksum verification for streaming/repair' from Taras Veretilnyk This PR enables integrity check of both checksum and digest for repair/streaming. In the past, streaming readers only verified the checksum of compressed SSTables. This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped. To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression. Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`. Backport is not required, it is an improvement Refs #21776 Closes scylladb/scylladb#26444 * github.com:scylladb/scylladb: boost/repair_test: add repair reader integrity verification test cases test/lib: allow to disable compression in random_mutation_generator sstables: Skip checksum and digest reads for unlinked SSTables table: enable integrity checks for streaming reader table: Add integrity option to table::make_sstable_reader() sstables: Add integrity option to create_single_key_sstable_reader	2025-11-14 18:00:33 +02:00
Benny Halevy	f9ce98384a	scylla-sstable: correctly dump sharding_metadata This patch fixes 2 issues at one go: First, Currently sstables::load clears the sharding metadata (via open_data()), and so scylla-sstable always prints an empty array for it. Second, printing token values would generate invalid json as they are currently printed as binary bytes, and they should be printed simply as numbers, as we do elsewhere, for example, for the first and last keys. Fixes #26982 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26991	2025-11-14 17:55:41 +02:00
Aleksandra Martyniuk	e3dcb7e827	test: extend test_batchlog_replay_failure_during_repair Modify test_batchlog_replay_failure_during_repair to also check that there isn't data resurrection if flushing hints falls within the repair cache timeout.	2025-11-14 14:18:07 +01:00
Pavel Emelyanov	1c9c4c8c8c	Merge 'service: attach storage_service to migration_manager using pluggable' from Marcin Maliszkiewicz Migration manager depends on storage service. For instance, it has a reload_schema_in_bg background task which calls _ss.local() so it expects that storage service is not stopped before it stops. To solve this we use permit approach, and during storage_service stop: - we ignore new code execution in migration_manager which'd use storage_service - but wait with storage_service shutdown until all existing executions are done Fixes scylladb/scylladb#26734 Backport: no need, problem existed since very long time, code restructure in https://github.com/scylladb/scylladb/commit/389afcd (and following commits) made it hitting more often, as _ss was called earlier, but it's not released yet. Closes scylladb/scylladb#26779 * github.com:scylladb/scylladb: service: attach storage_service to migration_manager using pluggabe service: migration_manager: corutinize merge_schema_from service: migration_manager: corutinize reload_schema	2025-11-14 15:14:28 +03:00
Piotr Dulikowski	2ccc94c496	Merge 'topology_coordinator: include joining node in barrier' from Michael Litvak Previously, only nodes in the 'normal' state and decommissioning nodes were included in the set of nodes participating in barrier and barrier_and_drain commands. Joining nodes are not included because they don't coordinate requests, given their cql port is closed. However, joining nodes may receive mutations from other nodes, for which they may generate and coordinate materialized view updates. If their group0 state is not synchronized it could cause lost view updates. For example: 1. On the topology coordinator, the join completes and the joining node becomes normal, but the joining node's state lags behind. Since it's not synchronized by the barrier, it could be in an old state such as `write_both_read_old`. 2. A normal node coordinates a write and sends it to the new node as the new replica. 3. The new node applies the base mutation but doesn't generate a view update for it, because it calculates the base-view pairing according to its own state and replication map, and determines that it doesn't participate in the base-view pairing. Therefore, since the joining node participates as a coordinator for view updates, it should be included in these barriers as well. This ensures that before the join completes, the joining node's state is `write_both_read_new`, where it does generate view updates. Fixes https://github.com/scylladb/scylladb/issues/26976 backport to previous versions since it fixes a bug in MV with vnodes Closes scylladb/scylladb#27008 * github.com:scylladb/scylladb: test: add mv write during node join test topology_coordinator: include joining node in barrier	2025-11-14 12:41:16 +01:00
Pavel Emelyanov	604e5b6727	sstables_manager: Move column_index_auto_scale_threshold on config Set its default value to the one from db/config.cc. Only the replica::database may want to re-configure it. This one is live-updateable, so use updateable_value<> type. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:30:49 +03:00
Pavel Emelyanov	8f9f92728e	sstables_manager: Move column_index_size on config Set its default value to the one from db/config.cc. Only replica::database may want to re-configure it. Also not live-updateable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:30:28 +03:00
Pavel Emelyanov	88bb203c9c	sstables_manager: Move sstable_summary_ratio on config Set its default value to the one from db/config.cc. Only replica::database may want to re-configure it. Also not live-updateable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:29:34 +03:00
Pavel Emelyanov	1f6918be3f	sstables_manager: Move enable_sstable_key_validation on config Make it OFF by default and update only those callers, that may have it ON -- the replica::database, tests and scylla-sstable tool. Also not live-updateable, so plain bool. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:28:14 +03:00
Pavel Emelyanov	79d0f93693	sstables_manager: Move available_memory on config Currently, this parameter is passed to sstables_manager as explicit constructor argument. Also, it's not live-updateable, so a plain size_t type for it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:27:14 +03:00
Pavel Emelyanov	218916e7c2	code: Introduce sstables_manager::config This is specific configuration for sstables_manager. All places that construct sstables manager are updated to provide config to it. For now the config is empty and exists alongside with db::config. Further patches will populate the former config with data and the latter config will be eventually removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:25:18 +03:00
Pavel Emelyanov	004ba32fa5	sstables: Patch get_local_directories() to work on vector of paths Now it uses db::config. Next patches will eliminate db::config from this code and the helper in question will need to get datadir names explicitly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:24:04 +03:00
Pavel Emelyanov	1895d85ed2	code: Rename sstables_manager::config() into db_config() The config() method name is going to return sstables_manager config, so first need to set this name free. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:23:08 +03:00
Patryk Jędrzejczak	1141342c4f	Merge 'topology: refactor excluded nodes' from Petr Gusev This PR refactors excluded nodes handling for tablets and topology. For tablets a dedicated variable `topology::excluded_tablet_nodes` is introduced, for topology operations a method get_excluded_nodes() is inlined into topology_coordinator and renamed to `get_excluded_nodes_for_topology_request`. The PR improves codes readability and efficiency, no behavior changes. backport: this is a refactoring/optimization, no need to backport Closes scylladb/scylladb#26907 * https://github.com/scylladb/scylladb: topology_coordinator: drop unused exec_global_command overload topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request topology_state_machine: inline get_excluded_nodes messaging_service: simplify and optimize ban_host storage_service: topology_state_load: extract topology variable topology_coordinator: excluded_tablet_nodes -> ignored_nodes topology_state_machine: add excluded_tablet_nodes field	2025-11-14 11:52:00 +01:00
Piotr Dulikowski	68407a09ed	Merge 'vector_store_client: Add support for failed-node backoff' from Karol Nowacki vector_search: Add backoff for failed nodes Introduces logic to mark nodes that fail to answer an ANN request as "down". Down nodes are omitted from further requests until they successfully respond to a health check. Health checks for down nodes are performed in the background using the `status` endpoint, with an exponential backoff retry policy ranging from 100ms to 20s. Client list management is moved to separate files (clients.cc/clients.hh) to improve code organization and modularity. References: VECTOR-187. Backport to 2025.4 as this feature is expected to be available in 2025.4. Closes scylladb/scylladb#26308 * github.com:scylladb/scylladb: vector_search: Set max backoff delay to 2x read request timeout vector_search: Report status check exception via on_internal_error_noexcept vector_search: Extract client management into dedicated class vector_search: Add backoff for failed clients vector_search: Make endpoint available vector_search: Use std::expected for low-level client errors vector_search: Extract client class	2025-11-14 11:49:18 +01:00
Piotr Dulikowski	833b824905	Merge 'service/qos: Fall back to default scheduling group when using maintenance socket' from Dawid Mędrek The service level controller relies on `auth::service` to collect information about roles and the relation between them and the service levels (those attached to them). Unfortunately, the service level controller is initialized way earlier than `auth::service` and so we had to prevent potential invalid queries of user service levels (cf. `46193f5e79`). Unfortunately, that came at a price: it made the maintenance socket incompatible with the current implementation of the service level controller. The maintenance socket starts early, before the `auth::service` is fully initialized and registered, and is exposed almost immediately. If the user attempts to connect to Scylla within this time window, via the maintenance socket, one of the things that will happen is choosing the right service level for the connection. Since the `auth::service` is not registered, Scylla with fail an assertion and crash. A similar scenario occurs when using maintenance mode. The maintenance socket is how the user communicates with the database, and we're not prepared for that either. To avoid unnecessary crashes, we add new branches if the passed user is absent or if it corresponds to the anonymous role. Since the role corresponding to a connection via the maintenance socket is the anonymous role, that solves the problem. Some accesses to `auth::service` are not affected and we do not modify those. Fixes scylladb/scylladb#26816 Backport: yes. This is a fix of a regression. Closes scylladb/scylladb#26856 * github.com:scylladb/scylladb: test/cluster/test_maintenance_mode.py: Wait for initialization test: Disable maintenance mode correctly in test_maintenance_mode.py test: Fix keyspace in test_maintenance_mode.py service/qos: Do not crash Scylla if auth_integration absent	2025-11-14 11:12:28 +01:00
Botond Dénes	43738298be	db/config: don't use RBNO for scaling Remove bootstrap and decomission from allowed_repair_based_node_ops. Using RBNO over streaming for these operations has no benefits, as they are not exposed to the out-of-date replica problem that replace, removenode and rebuild are. On top of that, RBNO is known to have problems with empty user tables. Using streaming for boostrap and decomission is safe and faster than RBNO in all condition, especially when the table is small. One test needs adjustment as it relies on RBNO being used for all node ops. Fixes: #24664 Closes scylladb/scylladb#26330	2025-11-14 13:03:50 +03:00
Piotr Dulikowski	43506e5f28	Merge 'db/view: Add backoff when RPC fails' from Dawid Mędrek The view building coordinator manages the process by sending RPC requests to all nodes in the cluster, instructing them what to do. If processing that message fails, the coordinator decides if it wants to retry it or (temporarily) abandon the work. An example of the latter scenario could be if one of the target nodes dies and any attempts to communicate with it would fail. Unfortunately, the current approach to it is not perfect and may result in a storm of warnings, effectively clogging the logs. As an example, take a look at scylladb/scylladb#26686: the gossiper failed to mark one of the dead nodes as DOWN fast enough, and it resulted in a warning storm. To prevent situations like that, we implement a form of backoff. If processing an RPC message fails, we postpone finishing the task for a second. That should reduce the number of messages in the logs and avoid retries that are likely to fail as well. We provide a reproducer test. Fixes scylladb/scylladb#26686 Backport: impact on the user. We should backport it to 2025.4. Closes scylladb/scylladb#26729 * github.com:scylladb/scylladb: tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc db/view/view_building_coordinator: Rate limit logging failed RPC db/view: Add backoff when RPC fails	2025-11-14 10:17:57 +01:00
Piotr Dulikowski	308c5d0563	Merge 'cdc: set column drop timestamp in the future' from Michael Litvak When dropping a column from a CDC log table, set the column drop timestamp several seconds into the future. If a value is written to a column concurrently with dropping that column, the value's timestamp may be after the column drop timestamp. If this value is also flushed to an SSTable, the SSTable would be corrupted, because it considers the column missing after the drop timestamp and doesn't allow values for it. While this issue affects general tables, it especially impacts CDC tables because this scenario can occur when writing to a table with CDC preimage enabled while dropping a column from the base table. This happens even if the base mutation doesn't write to the dropped column, because CDC log mutations can generate values for a column even if the base mutation doesn't. For general tables, this issue can be avoided by simply not writing to a column while dropping it. We fix this for the more problematic case of CDC log tables by setting the column drop timestamp several seconds into the future, ensuring that writes concurrent with column drops are much less likely to have timestamps greater than the column drop timestamp. Fixes https://github.com/scylladb/scylladb/issues/26340 the issue affects all previous releases, backport to improve stability Closes scylladb/scylladb#26533 * github.com:scylladb/scylladb: test: test concurrent writes with column drop with cdc preimage cdc: check if recreating a column too soon cdc: set column drop timestamp in the future migration_manager: pass timestamp to pre_create	2025-11-14 08:52:34 +01:00
Marcin Maliszkiewicz	958d04c349	service: attach storage_service to migration_manager using pluggabe Migration manager depends on storage service. For instance, it has a reload_schema_in_bg background task which calls _ss.local() so it expects that storage service is not stopped before it stops. To solve this we use permit approach, and during storage_service stop: - we ignore new code execution in migration_manager which'd use storage_service - but wait with storage_service shutdown until all existing executions are done Fixes scylladb/scylladb#26734	2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz	cf9b2de18b	service: migration_manager: corutinize merge_schema_from It's needed to easily keep-alive pluggable storage_service permit in a following commit.	2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz	5241e9476f	service: migration_manager: corutinize reload_schema It's needed to easily keep-alive pluggable storage_service permit in a following commit.	2025-11-14 08:50:18 +01:00
Tomasz Grabiec	27e74fa567	tools: scylla-sstable: Print filename and tablet ids on error Since error is not printed to stdout, when working with multiple files, we don't know whith which sstable the error is associated with. Closes scylladb/scylladb#27009	2025-11-14 09:47:38 +02:00
Karol Nowacki	1972fb315b	vector_search: Set max backoff delay to 2x read request timeout The maximum backoff delay for status checking now depends on the `read_request_timeout_in_ms` configuration option. The delay is set to twice the value of this parameter.	2025-11-14 08:05:21 +01:00
Karol Nowacki	097c0f9592	vector_search: Report status check exception via on_internal_error_noexcept This exception should only occur due to internal errors, not client or external issues. If triggered, it indicates an internal problem. Therefore, we notify about this exception using on_internal_error_noexcept.	2025-11-14 08:05:21 +01:00
Karol Nowacki	940ed239b2	vector_search: Extract client management into dedicated class Refactor client list management by moving it to separate files (clients.cc/clients.hh) to improve code organization and modularity.	2025-11-14 08:05:21 +01:00
Karol Nowacki	009d3ea278	vector_search: Add backoff for failed clients Introduces logic to mark clients that fail to answer an ANN request as "down". Down clients are omitted from further requests until they successfully respond to a health check. Health checks for down clients are performed in the background using the `status` endpoint, with an exponential backoff retry policy ranging from 100ms to 20s.	2025-11-14 07:38:01 +01:00
Karol Nowacki	190459aefa	vector_search: Make endpoint available In preparation for a new feature, the tests need the ability to make an endpoint that was previously unavailable, available again. This is achieved by adding an `unavailable_server::take_socket` method. This method allows transferring the listening socket from the `unavailable_server` to the `mock_vs_server`, ensuring they both operate on the same endpoint.	2025-11-14 07:23:40 +01:00
Karol Nowacki	49a177b51e	vector_search: Use std::expected for low-level client errors To unify error handling, the low-level client methods now return `std::expected` instead of throwing exceptions. This allows for consistent and explicit error propagation from the client up to the caller. The relevant error types have been moved to a new `vector_search/error.hh` header to centralize their definitions.	2025-11-14 07:23:40 +01:00
Karol Nowacki	62f8b26bd7	vector_search: Extract client class This refactoring extracts low-level client logic into a new, dedicated `client` class. The new class is responsible for connecting to the server and serializing requests. This change prepares for extending the `vector_store_client` to check node status via the `api/v1/status` endpoint. `/ann` Response deserialization remains in the `vector_store_client` as it is schema-dependent.	2025-11-14 07:23:40 +01:00
Lakshmi Narayanan Sreethar	3eba90041f	sstables: prevent oversized allocation when parsing summary positions During sstable summary parsing, the entire header was read into a single buffer upfront and then parsed to obtain the positions. If the header was too large, it could trigger oversized allocation warnings. This commit updates the parse method to read one position at a time from the input stream instead of reading the entire header at once. Since `random_access_reader` already maintains an internal buffer of 128 KB, there is no need to pre read the entire header upfront. Fixes #24428 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26846	2025-11-14 06:40:53 +02:00
Dawid Mędrek	393f1ca6e6	tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc After the changes in the test, we clean up its syntax. It boils down to very simple modifications.	2025-11-13 17:57:33 +01:00
Dawid Mędrek	acd9120181	db/view/view_building_coordinator: Rate limit logging failed RPC The view building coordinator sends tasks in form of RPC messages to other nodes in the cluster. If processing that RPC fails, the coordinator logs the error. However, since tasks are per replica (so per shard), it may happen that we end up with a large number of similar messages, e.g. if the target node has died, because every shard will fail to process its RPC message. It might become even worse in the case of a network partition. To mitigate that, we rate limit the logging by 1 seconds. We extend the test `test_backoff_when_node_fails_task_rpc` so that it allows the view building coordinator to have multiple tablet replica targets. If not for rate limiting the warning messages, we should start getting more of them, potentially leading to a test failure.	2025-11-13 17:57:23 +01:00
Dawid Mędrek	4a5b1ab40a	db/view: Add backoff when RPC fails The view building coordinator manages the process of view building by sending RPC requests to all nodes in the cluster, instructing them what to do. If processing that message fails, the coordinator decides if it wants to retry it or (temporarily) abandon the work. An example of the latter scenario could be if one of the target nodes dies and any attempts to communicate with it would fail. Unfortunately, the current approach to it is not perfect and may result in a storm of warnings, effectively clogging the logs. As an example, take a look at scylladb/scylladb#26686: the gossiper failed to mark one of the dead nodes as DOWN fast enough, and it resulted in a warning storm. To prevent situations like that, we implement a form of backoff. If processing an RPC message fails, we postpone finishing the task for a second. That should reduce the number of messages in the logs and avoid retries that are likely to fail as well. We provide a reproducer test: it fails before this commit and succeeds with it. Fixes scylladb/scylladb#26686	2025-11-13 17:55:41 +01:00
Michał Hudobski	7646dde25b	select_statement: add a warning about unsupported paging for vs queries Currently we do not support paging for vector search queries. When we get such a query with paging enabled we ignore the paging and return the entire result. This behavior can be confusing for users, as there is no warning about paging not working with vector search. This patch fixes that by adding a warning to the result of ANN queries with paging enabled. Closes scylladb/scylladb#26384	2025-11-13 18:47:05 +02:00
Michael Litvak	e85051068d	test: test concurrent writes with column drop with cdc preimage add a test that writes to a table concurrently with dropping a column, where the table has CDC enabled with preimage. the test reproduces issue #26340 where this results in a malformed sstable.	2025-11-13 17:00:08 +01:00
Michael Litvak	039323d889	cdc: check if recreating a column too soon When we drop a column from a CDC log table, we set the column drop timestamp a few seconds into the future. This can cause unexpected problems if a user tries to recreate a CDC column too soon, before the drop timestamp has passed. To prevent this issue, when creating a CDC column we check its creation timestamp against the existing drop timestamp, if any, and fail with an informative error if the recreation attempt is too soon.	2025-11-13 17:00:07 +01:00
Michael Litvak	48298e38ab	cdc: set column drop timestamp in the future When dropping a column from a CDC log table, set the column drop timestamp several seconds into the future. If a value is written to a column concurrently with dropping that column, the value's timestamp may be after the column drop timestamp. If this value is also flushed to an SSTable, the SSTable would be corrupted, because it considers the column missing after the drop timestamp and doesn't allow values for it. While this issue affects general tables, it especially impacts CDC tables because this scenario can occur when writing to a table with CDC preimage enabled while dropping a column from the base table. This happens even if the base mutation doesn't write to the dropped column, because CDC log mutations can generate values for a column even if the base mutation doesn't. For general tables, this issue can be avoided by simply not writing to a column while dropping it. We fix this for the more problematic case of CDC log tables by setting the column drop timestamp several seconds into the future, ensuring that writes concurrent with column drops are much less likely to have timestamps greater than the column drop timestamp. Fixes scylladb/scylladb#26340	2025-11-13 16:59:43 +01:00
Michael Litvak	eefae4cc4e	migration_manager: pass timestamp to pre_create pass the write timestamp as parameter to the on_pre_create_column_families notification.	2025-11-13 16:59:43 +01:00
Piotr Dulikowski	7f482c39eb	Merge '[schema] Speculative retry rounding fix' from Dario Mirovic This patch series re-enables support for speculative retry values `0` and `100`. These values have been supported some time ago, before [schema: fix issue 21825: add validation for PERCENTILE values in speculative_retry configuration. #21879 ](https://github.com/scylladb/scylladb/pull/21879). When that PR prevented using invalid `101PERCENTILE` values, valid `100PERCENTILE` and `0PERCENTILE` value were prevented too. Reproduction steps from [[Bug]: drop schema and all tables after apply speculative_retry = '99.99PERCENTILE' #26369](https://github.com/scylladb/scylladb/issues/26369) are unable to reproduce the issue after the fix. A test is added to make sure the inclusive border values `0` and `100` are supported. Documentation is updated to give more information to the users. It now states that these border values are inclusive, and also that the precision, with automatic rounding, is 1 decimal digit. Fixes #26369 This is a bug fix. If at any time a client tries to use value >= 99.5 and < 100, the raft error will happen. Backport is needed. The code which introduced inconsistency is introduced in 2025.2, so no backporting to 2025.1. Closes scylladb/scylladb#26909 * github.com:scylladb/scylladb: test: cqlpy: add test case for non-numeric PERCENTILE value schema: speculative_retry: update exception type for sstring ops docs: cql: ddl.rst: update speculative-retry-options test: cqlpy: add test for valid speculative_retry values schema: speculative_retry: allow 0 and 100 PERCENTILE values	2025-11-13 15:27:45 +01:00
Petr Gusev	d3bd8c924d	topology_coordinator: drop unused exec_global_command overload	2025-11-13 14:19:03 +01:00
Petr Gusev	45d1302066	topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request This method is specific to topology requests -- node joining, replacing, decommissioning etc, everything that goes through topology::transition_state::write_both_read_old and raft_topology_cmd::command::stream_ranges. It shouldn't be used in other contexts -- to handle global topology requests (e.g. truncate table) or for tablets. Rename the method to make this more explicit.	2025-11-13 14:19:03 +01:00
Petr Gusev	bf8cc5358b	topology_state_machine: inline get_excluded_nodes The method is specific to topology_coordinator, which already contains a wrapper for it, so inline the topology method into it. Also, make the logic of the method more explicit and remove multiple transition_nodes lookups.	2025-11-13 14:18:46 +01:00
Taras Veretilnyk	e7ceb13c3b	boost/repair_test: add repair reader integrity verification test cases Adds test cases to verify that repair_reader correctly detects SSTable(both comprossed and uncompressed) checksum mismatch. Digest mismatch verification is not possible as repair readar may skip some sstable data, which automatically disables digest verification. Each test corrupts the Data component on disk and ensures the reader throws a malformed_sstable_exception with the expected error message.	2025-11-13 14:08:33 +01:00
Taras Veretilnyk	554ce17769	test/lib: allow to disable compression in random_mutation_generator Adds a compress flag to random_mutation_generator, allowing tests to disable compression in generated mutations. When set to compress::no, the schema builder uses no_compression() parameters.	2025-11-13 14:08:33 +01:00
Taras Veretilnyk	add60d7576	sstables: Skip checksum and digest reads for unlinked SSTables Add an _unlinked flag to track SSTable unlink state and check it in read_digest() and read_checksum() methods to skip file reads for unlinked SSTables, preventing potential file not found errors.	2025-11-13 14:08:26 +01:00
Michael Litvak	b925e047be	test: add mv write during node join test Add a test that reproduces the issue scylladb/scylladb#26976. The test adds a new node with delayed group0 apply, and does writes with MV updates right after the join completes on the coordinator and while the joining node's state is behind. The test fails before fixing the issue and passes after.	2025-11-13 12:24:32 +01:00
Michael Litvak	13d94576e5	topology_coordinator: include joining node in barrier Previously, only nodes in the 'normal' state and decommissioning nodes were included in the set of nodes participating in barrier and barrier_and_drain commands. Joining nodes are not included because they don't coordinate requests, given their cql port is closed. However, joining nodes may receive mutations from other nodes, for which they may generate and coordinate materialized view updates. If their group0 state is not synchronized it could cause lost view updates. For example: 1. On the topology coordinator, the join completes and the joining node becomes normal, but the joining node's state lags behind. Since it's not synchronized by the barrier, it could be in an old state such as `write_both_read_old`. 2. A normal node coordinates a write and sends it to the new node as the new replica. 3. The new node applies the base mutation but doesn't generate a view update for it, because it calculates the base-view pairing according to its own state and replication map, and determines that it doesn't participate in the base-view pairing. Therefore, since the joining node participates as a coordinator for view updates, it should be included in these barriers as well. This ensures that before the join completes, the joining node's state is `write_both_read_new`, where it does generate view updates. Fixes scylladb/scylladb#26976	2025-11-13 12:24:31 +01:00
Michał Chojnowski	346e0f64e2	replica/table: add a metric for hypothetical total file size without compression This patch adds a per-table metric `scylla_column_family_total_disk_space_before_compression`, which measures the hypothetical total size of sstables on disk, if Data.db was replaced with an uncompressed equivalent.	2025-11-13 11:28:19 +01:00
Dawid Mędrek	b357c8278f	test/cluster/test_maintenance_mode.py: Wait for initialization If we try to perform queries too early, before the call to `storage_service::start_maintenance_mode` has finished, we will fail with the following error: ``` ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index! ``` To avoid that, we should wait until initialization is complete.	2025-11-13 11:07:45 +01:00
Aleksandra Martyniuk	4d0de1126f	db: batchlog_manager: update _last_replay only if all batches were replayed Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection. Update _last_replay only if all batches were replayed. Fixes: https://github.com/scylladb/scylladb/issues/24415.	2025-11-13 10:40:19 +01:00
Piotr Dulikowski	2e5eb92f21	Merge 'cdc: use CDC schema that is compatible with the base schema' from Michael Litvak When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema. The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error. We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table. When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well. The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit. Fixes https://github.com/scylladb/scylladb/issues/26405 backport not needed - enhancement Closes scylladb/scylladb#24960 * github.com:scylladb/scylladb: test: cdc: test cdc compatible schema cdc: use compatiable cdc schema db: schema_applier: create schema with pointer to CDC schema db: schema_applier: extract cdc tables schema: add pointer to CDC schema schema_registry: remove base_info from global_schema_ptr schema_registry: use extended_frozen_schema in schema load schema_registry: replace frozen_schema+base_info with extended_frozen_schema frozen_schema: extract info from schema_ptr in the constructor frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema	2025-11-13 10:11:54 +01:00
Pavel Emelyanov	f47f2db710	Merge 'Support local primary-replica-only for native restore' from Robert Bindar This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with: - `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only - `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only. - `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself) - `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense. The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope. Fixes #26584 Closes scylladb/scylladb#26609 * github.com:scylladb/scylladb: Add cluster tests for checking scoped primary_replica_only streaming Improve choice distribution for primary replica Refactor cluster/object_store/test_backup nodetool restore: add primary-replica-only option nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only Enable scoped primary replica only streaming Support primary_replica_only for native restore API	2025-11-13 12:11:18 +03:00
Michał Chojnowski	1cfce430f1	replica/table: keep track of total pre-compression file size Every table and sstable set keeps track of the total file size of contained sstables. Due to a feature request, we also want to keep track of the hypothetical file size if Data files were uncompressed, to add a metric that shows the compression ratio of sstables. We achieve this by replacing the relevant `uint_64 bytes_on_disk` counters everywhere with a struct that contains both the actual (post-compression) size and the hypothetical pre-compression size. This patch isn't supposed to change any observable behavior. In the next patch, we will use these changes to add a new metric.	2025-11-13 00:49:57 +01:00
Tomasz Grabiec	10b893dc27	Merge 'load_stats: fix bug in migrate_tablet_size()' from Ferenc Szili `topology_cooridinator::migrate_tablet_size()` was introduced in `10f07fb95a`. It has a bug where the has_tablet_size() lambda always returns false because of bad comparison of iterators after a table and tablet search: ``` if (auto table_i = tables.find(gid.table); table_i != tables.find(gid.table)) { if (auto size_i = table_i->second.find(trange); size_i != table_i->second.find(trange)) { ``` This change also fixes a problem where the `migrate_tablet_size()` would crash with a `std::out_of_range` if the pending node was not present in load_stats. This change fixes these two problems and moves the functionality into a separate method of `load_stats`. It also adds tests for the new method. A version containing this bug has not been released yet, so no backport is needed. Closes scylladb/scylladb#26946 * github.com:scylladb/scylladb: load_stats: add test for migrate_tablet_size() load_stats: fix problem with tablet size migration	2025-11-12 23:48:37 +01:00
Nadav Har'El	5839574294	Merge 'cql3: Fix std::bad_cast when deserializing vectors of collections' from Karol Nowacki cql3: Fix std::bad_cast when deserializing vectors of collections This PR fixes a bug where attempting to INSERT a vector containing collections (e.g., `vector<set<int>,1>`) would fail. On the client side, this manifested as a `ServerError: std::bad_cast`. The cause was "type slicing" issue in the reserialize_value function. When retrieving the vector's element type, the result was being assigned by value (using auto) instead of by reference. This "sliced" the polymorphic abstract_type object, stripping it of its actual derived type information. As a result, a subsequent dynamic_cast would fail, even if the underlying type was correct. To prevent this entire class of bugs from happening again, I've made the polymorphic base class `abstract_type` explicitly uncopyable. Fixes: #26704 This fix needs to be backported as these releases are affected: `2025.4` , `2025.3`. Closes scylladb/scylladb#26740 * github.com:scylladb/scylladb: cql3: Make abstract_type explicitly noncopyable cql3: Fix std::bad_cast when deserializing vectors of collections	2025-11-13 00:24:25 +02:00
Petr Gusev	9fed80c4be	messaging_service: simplify and optimize ban_host We do one cross-shard call for all left+ignored nodes.	2025-11-12 12:27:44 +01:00
Petr Gusev	52cccc999e	storage_service: topology_state_load: extract topology variable It's inconvinient to always write the long expression _topology_state_machine._topology.	2025-11-12 12:27:44 +01:00
Petr Gusev	66063f202b	topology_coordinator: excluded_tablet_nodes -> ignored_nodes ignored_nodes is sufficient in these cases. excluded_tablet_nodes also includes left_nodes_rs, which are not needed here — global_token_metadata_barrier runs the barrier only on normal and transition nodes, not on left nodes.	2025-11-12 12:27:44 +01:00
Petr Gusev	82da83d0e5	topology_state_machine: add excluded_tablet_nodes field The topology_coordinator::is_excluded() creates a temporary hash map for each call. This is probably not a performance problem since left_nodes_rs contains only those left nodes that are referenced from tablet replicas, this happens temporarily while e.g. a replaced node is being rebuilt. On the other hand, why not just have a dedicated field in the topology_state_machine, then this code wouldn't look suspicious.	2025-11-12 12:27:43 +01:00
Gleb Natapov	e872f9cb4e	cleanup: Add RESTful API to allow reset cleanup needed flag Cleaning up a node using per keyspace/table interface does not reset cleanup needed flag in the topology. The assumption was that running cleanup on already clean node does nothing and completes quickly. But due to https://github.com/scylladb/scylladb/issues/12215 (which is closed as WONTFIX) this is not the case. This patch provides the ability to reset the flag in the topology if operator cleaned up the node manually already.	2025-11-12 10:56:57 +02:00
Nadav Har'El	4de88a7fdc	test/cqlpy: fix run script for materialized views on tablets Recently we enabled tablets by default, but it is necessary to enable rf_rack_valid_keyspaces if materialized views are to be used with tablets, and this option is not the default. We did add this option in test/pylib/scylla_cluster.py which is used by test.py, but we didn't add it to test/cqlpy/run.py, so the test/cqlpy/run script is no longer able to run tests with materialized views. So this patch adds the missing configuration to run.py. FIxes #26918 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26919	2025-11-12 11:56:21 +03:00
Karol Nowacki	77da4517d2	cql3: Make abstract_type explicitly noncopyable The polymorphic abstract_type class serves as an interface and should not be copied. To prevent accidental and unsafe copies, make it explicitly uncopyable.	2025-11-12 09:11:56 +01:00
Karol Nowacki	960fe3da60	cql3: Fix std::bad_cast when deserializing vectors of collections When deserializing a vector whose elements are collections (e.g., set, list), the operation raises a `std::bad_cast` exception. This was caused by type slicing due to an incorrect assignment of a polymorphic type by value instead of by reference. This resulted in a failed `dynamic_cast` even when the underlying type was correct.	2025-11-12 09:11:56 +01:00
Botond Dénes	6f6ee5581e	Merge 'encryption::kms_host: Add exponential backoff-retry for 503 errors' from Calle Wilund Refs #26822 AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and actual endpoint, doing exponential backoff. Note: we do _not_ retry forever. Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe. Closes scylladb/scylladb#26934 * github.com:scylladb/scylladb: encryption::kms_host: Add exponential backoff-retry for 503 errors encryption::kms_host: Include http error code in kms_error	2025-11-12 08:33:33 +02:00
Yaron Kaikov	3ade3d8f5b	auto-backport: Add support for JIRA issue references - Added support for JIRA issue references in PR body and commit messages - Supports both short format (PKG-92) and full URL format - Maintains existing GitHub issue reference support - JIRA pattern matches https://scylladb.atlassian.net/browse/{PROJECT-ID} - Allows backporting for PRs that reference JIRA issues with 'fixes' keyword Fixes: https://github.com/scylladb/scylladb/issues/26955 Closes scylladb/scylladb#26954	2025-11-12 08:15:06 +02:00
Calle Wilund	d22e0acf0b	encryption::kms_host: Add exponential backoff-retry for 503 errors Refs #26822 AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and actual endpoint, doing exponential backoff. Note: we do _not_ retry forever. Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe. v2: * Use utils::exponential_backoff_retry	2025-11-11 21:02:32 +00:00
Calle Wilund	190e3666cb	encryption::kms_host: Include http error code in kms_error Keep track of actual HTTP failure.	2025-11-11 21:02:32 +00:00
Ferenc Szili	fcbc239413	load_stats: add test for migrate_tablet_size() This change adds tests which validate the functionality of load_stats::migrate_tablet_size()	2025-11-11 14:28:31 +01:00
Ferenc Szili	b77ea1b8e1	load_stats: fix problem with tablet size migration This patch fixes a bug with tablet size migration in load_stats. has_tablet_size() lambda in topology_coordinator::migrate_tablet_size() was returning false in all cases due to incorrect search iterator comparison after a table and tablet saeach. This change moves load_stats migrate_tablet_sizes() functionaility into a separate method of load_stats.	2025-11-11 14:26:09 +01:00
Yehuda Lebi	a05ebbbfbb	dist/docker: add configurable blocked-reactor-notify-ms parameter Add --blocked-reactor-notify-ms argument to allow overriding the default blocked reactor notification timeout value of 25 ms. This change provides users the flexibility to customize the reactor notification timeout as needed. Fixes: scylladb/scylla-enterprise#5525 Closes scylladb/scylladb#26892	2025-11-11 12:38:40 +02:00
Benny Halevy	a290505239	utils: stall_free: add dispose_gently dispose_gently consumes the object moved to it, clearing it gently before it's destroyed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26356	2025-11-11 12:20:18 +02:00
Yaron Kaikov	c601371b57	install-dependencies.sh: update node_exporter to 1.10.2 Update node exporter to solve CVE-2025-22871 [regenerate frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz ] Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-5 Closes scylladb/scylladb#26916	2025-11-11 11:36:13 +02:00
Nadav Har'El	b659dfcbe9	test/cqlpy: comment out Cassandra check that is no longer relevant In the test translated from Cassandra validation/operations/alter_test.py we had two lines in the beginning of an unrelated test that verified that CREATE KEYSPACE is not allowed without replication parameters. But starting recently, ScyllaDB does have defaults and does allow these CREATE KEYSPACE. So comment out these two test lines. We didn't notice that this test started to fail, because it was already marked xfail, because in the main part of this test, it reproduces a different issue! The annoying side-affect of these no-longer-passing checks was that because the test expected a CREATE KEYSPACE to fail, it didn't bother to delete this keyspace when it finished, which causes test.py to report that there's a problem because some keyspaces still exist at the end of the test. Now that we fixed this problem, we no longer need to list this test in test/cqlpy/suite.yaml as a test that leaves behind undeleted keyspaces. Fixes #26292 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26341	2025-11-11 10:34:27 +02:00
Nikos Dragazis	56e5dfc14b	migration_manager: Add missing validations for schema extensions The migration manager offers some free functions to prepare mutations for a new/updated table/view. Most of them include a validation check for the schema extensions, but in the following ones it's missing: * `prepare_new_column_family_announcement` (overload with vector as out parameter) * `prepare_new_column_families_announcement` Presumably, this was just an omission. It's also not a very important one since the only extension having validation logic is the `encryption_schema_extension`, but none of these functions is connected to user queries where encryption options can be provided in the schema. User queries go through the other `prepare_new_column_family_announcement` overload, which does perform a validation check. Add validation in the missing places. Fixes #26470. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#26487	2025-11-11 10:08:58 +02:00
Botond Dénes	042303f0c9	Merge 'Alternator: enable tablets by default - depending on tablets_mode_for_new_keyspaces' from Nadav Har'El Before this series, Alternator's CreateTable operation defaults to creating a table replicated with vnodes, not tablets. The reasons for this default included missing support for LWT, Materialized Views, Alternator TTL and Alternator Streams if tablets are used. But today, all of these (except the still-experimental Alternator Streams) are now fully available with tablets, so we are finally ready to switch Alternator to use tablets by default in new tables. We will use the same configuration parameter that CQL uses, tablets_mode_for_new_keyspaces, to determine whether new keyspaces use tablets by default. If set to `enabled`, tablets are used by default on new tables. If set to `disabled`, tablets will not be used by default (i.e., vnodes will be used, as before). A third value, `enforced` is similar to `enabled` but forbids overriding the default to vnodes when creating a table. As before, the user can set a tag during the CreateTable operation to override the default choice of tablets or vnodes (unless in `enforced` mode). This tag is now named `system:initial_tablets` - whereas before this patch it was called `experimental:initial_tablets`. The rules stay the same as with the earlier, experimental:initial_tablets tag: when supplied with a numeric value, the table will use tablets. When supplied with something else (like a string "none"), the table will use vnodes. Fixes https://github.com/scylladb/scylladb/issues/22463 Backport to 2025.4, it's important not to delay phasing out vnodes. Closes scylladb/scylladb#26836 * github.com:scylladb/scylladb: test,alternator: use 3-rack clusters in tests alternator: improve error in tablets_mode_for_new_keyspaces=enforced config: make tablets_mode_for_new_keyspaces live-updatable alternator: improve comment about non-hidden system tags alternator: Fix test_ttl_expiration_streams() alternator: Fix test_scan_paging_missing_limit() alternator: Don't require vnodes for TTL tests alternator: Remove obsolete test from test_table.py alternator: Fix tag name to request vnodes alternator: Fix test name clash in test_tablets.py alternator: test_tablets.py handles new policy reg. tablets alternator: Update doc regarding tablets support alternator: Support `tablets_mode_for_new_keyspaces` config flag Fix incorrect hint for tablets_mode_for_new_keyspaces Fix comment for tablets_mode_for_new_keyspaces	2025-11-11 09:45:29 +02:00
Avi Kivity	bae2654b34	tools: dbuild: avoid `test -v` incompatibility with MacOS shell `test -v` isn't present on the MacOS shell. Since dbuild is intended as a compatibility bridge between the host environment and the build environment, don't use it there. Use ${var+text_if_set} expansion as a workaround. Fixes #26937 Closes scylladb/scylladb#26939	2025-11-11 09:43:14 +02:00
Nikos Dragazis	94c4f651ca	test/cqlpy: Test secondary index with short reads Add a test to check that paged secondary index queries behave correctly when pages are short. This is currently failing in Scylla, but passes in Cassandra 5, therefore marked as "xfailing". Refer to the test's docstring for more details. The bug is a regression introduced by commit `f6f18b1`. `test/cqlpy/run --release ...` shows that the test passes in 5.1 but fails in 5.2 onwards. Refs #25839. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#25843	2025-11-11 09:28:45 +02:00
Robert Bindar	a04ebb829c	Add cluster tests for checking scoped primary_replica_only streaming This commits adds a tests checking various scenarios of restoring via load and stream with primary_replica_only and a scope specified. The tests check that in a few topologies, a mutation is replicated a correct amount of times given primary_replica_only and that streaming happens according to the scope rule passed. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Robert Bindar	817fdadd49	Improve choice distribution for primary replica I noticed during tests that `maybe_get_primary_replica` would not distribute uniformly the choice of primary replica because `info.replicas` on some shards would have an order whilst on others it'd be ordered differently, thus making the function choose a node as primary replica multiple times when it clearly could've chosen a different nodes. This patch sorts the replica set before passing it through the scope filter. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Robert Bindar	d4e43bd34c	Refactor cluster/object_store/test_backup This PR splits the suppport code from test_backup.py into multiple functions so less duplicated code is produced by new tests using it. It also makes it a bit easier to understand. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Robert Bindar	c1b3fe30be	nodetool restore: add primary-replica-only option Add --primary-replica-only and update docs page for nodetool restore. The relationship with the scope parameter is: - scope=all primary_replica_only=true gets the global primary replica - scope=dc primary_replica_only=true gets the local primary replica - scope=rack primary_replica_only=true is like a noop, it gets the only replica in the rack (rf=#racks) - scope=node primary_replica_only=node is not allowed Fixes #26584 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Robert Bindar	83aee954b4	nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only So far it was not allowed to pass a scope when using the primary_replica_only option, this patch enables it because the concepts are now combined so that: - scope=all primary_replica_only=true gets the global primary replica - scope=dc primary_replica_only=true gets the local primary replica - scope=rack primary_replica_only=true is like a noop, it gets the only replica in the rack (rf=#racks) - scope=node primary_replica_only=node is not allowed Fixes #26584 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Robert Bindar	136b45d657	Enable scoped primary replica only streaming This patch removes the restriction for streaming to primary replica only within a scope. Node scope streaming to primary replica is dissallowed. Fixes #26584 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Robert Bindar	965a16ce6f	Support primary_replica_only for native restore API Current native restore does not support primary_replica_only, it is hard-coded disabled and this may lead to data amplification issues. This patch extends the restore REST API to accept a primary_replica_only parameter and propagates it to sstables_loader so it gets correctly passed to load_and_stream. Fixes #26584 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:17:52 +02:00
Dawid Mędrek	394207fd69	test: Disable maintenance mode correctly in test_maintenance_mode.py Although setting the value of `maintenance_mode` to the string `"false"` disables maintenance mode, the testing framework misinterprets the value and thinks that it's actually enabled. As a result, it might try to connect to Scylla via the maintenance socket, which we don't want.	2025-11-10 19:22:06 +01:00
Dawid Mędrek	222eab45f8	test: Fix keyspace in test_maintenance_mode.py The keyspace used in the test is not necessarily called `ks`.	2025-11-10 19:21:58 +01:00
Dawid Mędrek	c0f7622d12	service/qos: Do not crash Scylla if auth_integration absent If the user connects to Scylla via the maintenance socket, it may happen that `auth_integration` has not been registered in the service level controller yet. One example is maintenance mode when that will never happen; another when the connection occurs before Scylla is fully initialized. To avoid unnecessary crashes, we add new branches if the passed user is absent or if it corresponds to the anonymous role. Since the role corresponding to a connection via the maintenance socket is the anonymous role, that solves the problem. In those cases, we completely circumvent any calls to `auth_integration` and handle them separately. The modified methods are: * `get_user_scheduling_group`, * `with_user_service_level`, * `describe_service_levels`. For the first two, the new behavior is in line with the previous implementation of those functions. The last behaves differently now, but since it's a soft error, crashing the node is not necessary anyway. We throw an exception instead, whose error message should give the user a hint of what might be wrong. The other uses of `auth_integration` within the service level controller are not problematic: * `find_effective_service_level`, * `find_cached_effective_service_level`. They take the name of a role as their argument. Since the anonymous role doesn't have a name, it's not possible to call them with it. Fixes scylladb/scylladb#26816	2025-11-10 19:21:36 +01:00
Yaron Kaikov	850ec2c2b0	Trigger scylla-ci directly from PR instead of scylla-ci-route job Refactoring scylla-ci to be triggered directly from each PR using GitHub action. This will allow us to skip triggering CI when PR commit message was updated (which will save us un-needed CI runs) Also we can remove `Scylla-CI-route` pipeline which route each PR to the proper CI job under the release (GitHub action will do it automatically), to reduce complexity Fixes: https://scylladb.atlassian.net/browse/PKG-69 Closes scylladb/scylladb#26799	2025-11-10 15:10:11 +02:00
Pavel Emelyanov	decf86b146	Merge 'Make AWS & Azure KMS boost testing use fixture + include Azure in pytests' from Calle Wilund * Adds test fixture for AWS KMS * Adds test fixture for Azure KMS * Adds key provider proxy for Azure to pytests (ported dtests) * Make test gather for boost tests handle suites * Fix GCP test snafu Fixes #26781 Fixes #26780 Fixes #26776 Fixes #26775 Closes scylladb/scylladb#26785 * github.com:scylladb/scylladb: gcp_object_storage_test: Re-enable parallelism. test::pylib: Add azure (mock) testing to EAR matrix test::boost::encryption_at_rest: Remove redundant azure test indent test::boost::encryption_at_rest: Move azure tests to use fixture test::lib: Add azure mock/real server fixture test::pylib::boost: Fix test gather to handle test suites utils::gcp::object_storage: Fix typo in semaphore init test::boost::encryption_at_rest_test: Remove redundant indent test::boost::test_encryption_at_rest: Move to AWS KMS fixture for kms test test::boost::test_encryption_at_rest: Reorder tests and helpers ent::encryption: Make text helper routines take std::string test::pylib::dockerized_service: Handle docker/podman bind error message test::lib::aws_kms_fixture: Add a fixture object to run mock AWS KMS test::lib::gcs_fixture: Only set port if running docker image + more retry	2025-11-10 14:35:05 +03:00
Michał Jadwiszczak	9345c33d27	service/storage_service: migrate staging sstables in view building worker during intra-node migration Use methods introduces in previous commit and: - load staging sstables to the view building worker on the target shard, at the end of `streaming` stage - clear migrated staging sstables on source shard in `cleanup` stage This patch also removes skip mark in `test_staging_sstables_with_tablet_merge`. Fixes scylladb/scylladb#26244	2025-11-10 10:38:08 +01:00
Michał Jadwiszczak	4bc6361766	db/view/view_building_worker: support sstables intra-node migration We need to be able to load sstables on the target shard during intra-node tablet migration and to cleanup migrated sstables on the source shard.	2025-11-10 10:36:32 +01:00
Michał Jadwiszczak	c99231c4c2	db/view_building_worker: fix indent	2025-11-10 09:02:16 +01:00
Michał Jadwiszczak	2e8c096930	db/view/view_building_worker: don't organize staging sstables by last token There was a problem with staging sstables after tablet merge. Let's say there were 2 tablets and tablet 1 (lower last token) had an staging sstable. Then a tablet merge occured, so there is only one tablet now (higher last token). But entries in `_staging_sstables`, which are grouped by last token, are never adjusted. Since there shouldn't be thousands of sstables, we can just hold list of sstables per table and filter necessary entries when doing `process_staging` view building task.	2025-11-10 09:02:16 +01:00
Nadav Har'El	35f3a8d7db	docs/alternator: fix small mistake in compatibility.md docs/alternator/compatibility.md describes support for global (multi-DC) tables, and suggests that the CQL command "ALTER TABLE" should be used to change the replication of an Alternator table. But actually, the right command is "ALTER KEYSPACE", not "ALTER TABLE". So fix the document. Fixes #26737 Closes scylladb/scylladb#26872	2025-11-10 08:48:18 +03:00
Yauheni Khatsianevich	d3e62b15db	fix(test): minor typo fix, removing redundant param from logging Closes scylladb/scylladb#26901	2025-11-10 08:42:11 +03:00
Dario Mirovic	d364904ebe	test: dtest: audit_test.py: add AuditBackendComposite Add `AuditBackendComposite`, a test class which allows testing multiple audit outputs in a single run, implemented in `audit_composite_storage_helper` class. Add two more tests. `test_composite_audit_type_invalid` tests if an invalid audit mode among correct ones causes the same error as when it is the only specified audit mode. `test_composite_audit_empty_settings` tests if `'none'` audit mode, when specified along other audit modes, properly disables audit logging. Refs #26022	2025-11-10 00:31:34 +01:00
Dario Mirovic	a8ed607440	test: dtest: audit_test.py: group logs in dict per audit mode Before this patch audit test could process audit logs from a single audit output. This patch adds support for multiple audit outputs in the same run. The change is needed in order to test `audit_composite_storage_helper`, which can write to multiple audit outputs. Refs #26022	2025-11-10 00:31:34 +01:00
Dario Mirovic	afca230890	audit: write out to both table and syslog This patch adds support for multiple audit log outputs. If only one audit log output is enabled, the behavior does not change. If multiple audit log outputs are enabled, then the `audit_composite_storage_helper` class is used. It has a collection of `storage_helper` objects. Fixes #26022	2025-11-10 00:31:30 +01:00
Dario Mirovic	7ec9e23ee3	test: cqlpy: add test case for non-numeric PERCENTILE value Add test case for non-numeric PERCENTILE value, which raises an error different to the out-of-range invalid values. Regex in the test test_invalid_percentile_speculative_retry_values is expanded. Refs #26369	2025-11-09 13:59:36 +01:00
Dario Mirovic	85f059c148	schema: speculative_retry: update exception type for sstring ops Change speculative_retry::to_sstring and speculative_retry::from_sstring to throw exceptions::configuration_exception instead of std::invalid_argument. These errors can be triggered by CQL, so appropriate CQL exception should be used. Reference: https://github.com/scylladb/scylladb/issues/24748#issuecomment-3025213304 Refs #26369	2025-11-09 13:55:57 +01:00
Dario Mirovic	aba4c006ba	docs: cql: ddl.rst: update speculative-retry-options Clarify how the value of `XPERCENTILE` is handled: - Values 0 and 100 are supported - The percentile value is rounded to the nearest 0.1 (1 decimal place) Refs #26369	2025-11-09 13:23:29 +01:00
Dario Mirovic	5d1913a502	test: cqlpy: add test for valid speculative_retry values test_valid_percentile_speculative_retry_values is introduced to test that valid values for speculative_retry are properly accepted. Some of the values are moved from the test_invalid_percentile_speculative_retry_values test, because the previous commit added support for them. Refs #26369	2025-11-09 13:23:26 +01:00
Dario Mirovic	da2ac90bb6	schema: speculative_retry: allow 0 and 100 PERCENTILE values This patch allows specifying 0 and 100 PERCENTILE values in speculative_retry. It was possible to specify these values before #21825. #21825 prevented specifying invalid values, like -1 and 101, but also prevented using 0 and 100. On top of that, speculative_retry::to_sstring function did rounding when formatting the string, which introduced inconsistency. Fixes #26369	2025-11-09 12:26:27 +01:00
Nadav Har'El	65ed678109	test,alternator: use 3-rack clusters in tests With tablets enabled, we can't create an Alternator table on a three- node cluster with a single rack, since Scylla refuses RF=3 with just one rack and we get the error: An error occurred (InternalServerError) when calling the CreateTable operation: ... Replication factor 3 exceeds the number of racks (1) in dc datacenter1 So in test/cluster/test_alternator.py we need to use the incantation "auto_rack_dc='dc1'" every time that we create a three-node cluster. Before this patch, several tests in test/cluster/test_alternator.py failed on this error, with this patch all of them pass. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-09 12:52:29 +02:00
Nadav Har'El	c03081eb12	alternator: improve error in tablets_mode_for_new_keyspaces=enforced When in tablets_mode_for_new_keyspaces=enforced mode, Alternator is supposed to fail when CreateTable asks explicitly for vnodes. Before this patch, this error was an ugly "Internal Server Error" (an exception thrown from deep inside the implementation), this patch checks for this case in the right place, to generate a proper ValidationException with a proper error message. We also enable the test test_tablets_tag_vs_config which should have caught this error, but didn't because it was marked xfail because tablets_mode_for_new_keyspaces had not been live-updatable. Now that it is, we can enable the test. I also improved the test to be slightly faster (no need to change the configuration so many times) and also check the ordinary case - where the schema doesn't choose neither vnodes nor tablets explicitly and we should just use the default. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-09 12:52:29 +02:00
Nadav Har'El	25439127c8	config: make tablets_mode_for_new_keyspaces live-updatable We have a configuration option "tablets_mode_for_new_keyspaces" which determines whether new keyspaces should use tablets or vnodes. For some reason, this configuration parameter was never marked live- updatable, so in this patch we add flag. No other changes are needed - the existing code that uses this flag always uses it through the up-to-date configuration. In the previous patches we start to honor tablets_mode_for_new_keyspaces also in Alternator CreateTable, and we wanted to test this but couldn't do this in test/alternator because the option was not live-updatable. Now that it will be, we'll be able to test this feature in test/alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-09 12:52:29 +02:00
Nadav Har'El	b34f28dae2	alternator: improve comment about non-hidden system tags The previous patches added a somewhat misleading comment in front of system:initial_tablets, which this patch improves. That tag is NOT where Alternator "stores" table properties like the existing comment claimed. In fact, the whole point is that it's the opposite - Alternator never writes to this tag - it's a user-writable tag which Alternator reads, to configure the new table. And this is why it obviously can't be hidden from the user. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-09 12:52:29 +02:00
Piotr Szymaniak	eeb3a40afb	alternator: Fix test_ttl_expiration_streams() The test is now aware of the new name of the `system:initial_tablets` tag.	2025-11-09 12:52:29 +02:00
Piotr Szymaniak	a659698c6d	alternator: Fix test_scan_paging_missing_limit() With tablets, the test begun failing. The failure was correlated with the number of initial tablets, which when kept at default, equals 4 tablets per shard in release build and 2 tablets per shard in dev build. In this patch we split the test into two - one with a more data in the table to check the original purpose of this test - that Scan doesn't return the entire table in one page if "Limit" is missing. The other test reproduces issue #10327 - that when the table is small, Scan's page size isn't strictly limited to 1MB as it is in DynamoDB. Experimentally, 8000 KB of data (compared to 6000 KB before this patch) is enough when we have up to 4 initial tablets per shard (so 8 initial tablets on a two-shard node as we typically run in tests). Original patch by Piotr Szymaniak <piotr.szymaniak@scylladb.com> modified by Nadav Har'El <nyh@scylladb.com>	2025-11-09 12:52:29 +02:00
Piotr Szymaniak	345747775b	alternator: Don't require vnodes for TTL tests Since #23662 Alternator supports TTL with tablets too. Let's clear some leftovers causing Alternator to test TTL with vnodes instead of with what is default for Alternator (tablets or vnodes).	2025-11-09 12:52:29 +02:00
Piotr Szymaniak	274d0b6d62	alternator: Remove obsolete test from test_table.py Since Alternator is capable of runnng with tablets according to the flag in config, remove the obsolete test that is making sure that Alternator runs with vnodes.	2025-11-09 12:52:29 +02:00
Piotr Szymaniak	63897370cb	alternator: Fix tag name to request vnodes The tag was lately renamed from `experimental:initial_tablets` to `system::initial_tablets`. This commit fixes both the tests as well as the exceptions sent to the user instructing how to create table with vnodes.	2025-11-09 12:52:29 +02:00
Piotr Szymaniak	c7de7e76f4	alternator: Fix test name clash in test_tablets.py	2025-11-09 12:52:28 +02:00
Piotr Szymaniak	7466325028	alternator: test_tablets.py handles new policy reg. tablets Adjust the tests so they are in-line with the config flag 'tablets_mode_for_new_keyspaces` that the Alternator learned to honour.	2025-11-09 12:52:28 +02:00
Piotr Szymaniak	35216d2f01	alternator: Update doc regarding tablets support Reflect honouring by Alternator the value of the config flag `tablets_mode_for_new_keyspaces`, as well as renaming of the tag `experimental:initial_tablets` into `system:initial_tablets`.	2025-11-09 12:52:28 +02:00
Piotr Szymaniak	376a2f2109	alternator: Support `tablets_mode_for_new_keyspaces` config flag Until now, tablets in Alternator were experimental feature enabled only when a TAG "experimental:initial_tablets" was present when creating a table and associated with a numeric value. After this patch, Alternator honours the value of `tablets_mode_for_new_keyspaces` config flag. Each table can be overriden to use tablets or not by supplying a new TAG "system:initial_tablets". The rules stay the same as with the earlier, experimental tag: when supplied with a numeric value, the table will use tablets (as long as they are supported). When supplied with something else (like a string "none"), the table will use vnodes, provided that tablets are not `enforced` by the config flag. Fixes #22463	2025-11-09 12:52:17 +02:00
Piotr Szymaniak	af00b59930	Fix incorrect hint for tablets_mode_for_new_keyspaces	2025-11-09 10:49:46 +02:00
Piotr Szymaniak	403068cb3d	Fix comment for tablets_mode_for_new_keyspaces The comment was not listing all the 3 possible values correctly, despite an explanation just below covers all 3 values.	2025-11-09 10:49:46 +02:00
Botond Dénes	cdba3bebda	Merge 'Generalize directory checks in database_test's snapshot test cases' from Pavel Emelyanov Those test cases use lister::scan_dir() to validate the contents of snapshot directory of a table against this table's base directory. This PR generalizes the listing code making it shorter. Also, the snapshot_skip_flush_works case is missing the check for "schema.cql" file. Nothing is wrong with it, but the test is more accurate if checking it. Also, the snapshot_with_quarantine_works case tries to check if one set of names is sub-set of another using lengthy code. Using std::includes improves the test readability a lot. Also, the PR replaces lister::scan_dir() with directory_lister. The former is going to be removed some day (see also #26586) Improving existing working test, no backport is needed. Closes scylladb/scylladb#26693 * github.com:scylladb/scylladb: database_test: Simplify snapshot_with_quarantine_works() test database_test: Improve snapshot_skip_flush_works test database_test: Simplify snapshot_works() tests database_test: Use collect_files() to remove files database_test: Use collectz_files() to count files in directory database_test: Introduce collect_files() helper	2025-11-07 16:04:02 +02:00
Michał Chojnowski	b82c2aec96	sstables/trie: fix an assertion violation in bti_partition_index_writer_impl::write_last_key _last_key is a multi-fragment buffer. Some prefix of _last_key (up to _last_key_mismatch) is unneeded because it's already a part of the trie. Some suffix of _last_key (after needed_prefix) is unneeded because _last_key can be differentiated from its neighbors even without it. The job of write_last_key() is to find the middle fragments, (containing the range `[_last_key_mismatch, needed_prefix)`) trim the first and last of the middle fragments appropriately, and feed them to the trie writer. But there's an error in the current logic, in the case where `_last_key_mismatch` falls on a fragment boundary. To describe it with an example, if the key is fragmented like `aaa\|bbb\|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`, then the intended output to the trie writer is `bbb\|c`, but the actual output is `\|bbb\|c`. (I.e. the first fragment is empty). Technically the trie writer could handle empty fragments, but it has an assertion against them, because they are a questionable thing. Fix that. We also extend bti_index_test so that it's able to hit the assert violation (before the patch). The reason why it wasn't able to do that before the patch is that the violation requires decorated keys to differ on the _first_ byte of a partition key column, but the keys generated by the test only differed on the last byte of the column. (Because the test was using sequential integers to make the values more human-readable during debugging). So we modify the key generation to use random values that can differ on any position. Fixes scylladb/scylladb#26819 Closes scylladb/scylladb#26839	2025-11-07 11:25:07 +02:00
Abhinav Jha	ab0e0eab90	raft topology: skip non-idempotent steps in decommission path to avoid problems during races In the present scenario, there are issues in left_token_ring transition state execution in the decommissioning path. In case of concurrent mutation race conditions, we enter left_token_ring more than once, and apparently if we enter left token ring second time, we try to barrier the decommisioned node, which at this point is no longer possible. That's what causes the errors. This pr resolves the issue by adding a check right in the start of left_token_ring to check if the first topology state update, which marks the request as done is completed. In this case, its confirmed that this is the second time flow is entering left_token_ring and the steps preceding the request status update should be skipped. In such cases, all the rest steps are skipped and topology node status update( which threw error in previous trial) is executed directly. Node removal status from group0 is also checked and remove operation is retried if failed last time. Although these changes are done with regard to the decommission operation behavior in `left_token_ring` transition state, but since the pr doesn't interfere with the core logic, it should not derail any rollback specific logic. The changes just prevent some non-idempotent operations from re-occuring in case of failures. Rest of the core logic remain intact. Test is also added to confirm the proper working of the same. Fixes: scylladb/scylladb#20865 Backport is not needed, since this is not a super critical bug fix. Closes scylladb/scylladb#26717	2025-11-07 10:07:49 +01:00
Ran Regev	aaf53e9c42	nodetool refresh primary-replica-only Fixes: #26440 1. Added description to primary-replica-only option 2. Fixed code text to better reflect the constrained cheked in the code itself. namely: that both primary replica only and scope must be applied only if load and steam is applied too, and that they are mutual exclusive to each other. Note: when https://github.com/scylladb/scylladb/issues/26584 is implemented (with #26609) there will be a need to align the docs as well - namely, primary-replica-only and scope will no longer be mutual exclusive Signed-off-by: Ran Regev <ran.regev@scylladb.com> Closes scylladb/scylladb#26480	2025-11-07 10:59:27 +02:00
Avi Kivity	245173cc33	tools: toolchain: optimized_clang: remove unused variable CLANG_SUFFIX The variable was unused since `cae999c094` ("toolchain: change optimized clang install method to standard one"), and now causes the differential shellcheck continuous integration test to fail whenever it is changed. Remove it. Closes scylladb/scylladb#26796	2025-11-07 10:08:23 +02:00
Patryk Jędrzejczak	d6c64097ad	Merge 'storage_proxy: use gates to track write handlers destruction' from Petr Gusev In [#26408](https://github.com/scylladb/scylladb/pull/26408) a `write_handler_destroy_promise` class was introduced to wait for `abstract_write_response_handler` instances destruction. We strived to minimize the memory footprint of `abstract_write_response_handler`, with `write_handler_destroy_promise`-es we required only a single additional int. It turned our that in some cases a lot of write handlers can be scheduled for deletion at the same time, in such cases the vector can become big and cause 'oversized allocation' seastar warnings. Another concern with `write_handler_destroy_promise`-es [was that they were more complicated than it was worth](https://github.com/scylladb/scylladb/pull/26408#pullrequestreview-3361001103). In this commit we replace `write_handler_destroy_promise` with simple gates. One or more gates can be attached to an `abstract_write_response_handler` to wait for its destruction. We use `utils::small_vector` to store the attached gates. The limit 2 was chosen because we expect two gates at the same time in most cases. One is `storage_proxy::_write_handlers_gate`, which is used to wait for all handlers in `cancel_all_write_response_handlers`. Another one can be attached by a caller of `cancel_write_handlers`. Nothing stops several cancel_write_handlers to be called at the same time, but it should be rare. The `sizeof(utils::small_vector) == 40`, this is `40.0 / 488 * 100 ~ 8%` increase in `sizeof(abstract_write_response_handler)`, which seems acceptable. Fixes [scylladb/scylladb#26788](https://github.com/scylladb/scylladb/issues/26788) backport: need to backport to 2025.4 (LWT for tablets release) Closes scylladb/scylladb#26827 * https://github.com/scylladb/scylladb: storage_proxy: use coroutine::maybe_yield(); storage_proxy: use gates to track write handlers destruction	2025-11-06 10:17:04 +01:00
Nadav Har'El	b8da623574	Update tools/cqlsh submodule * tools/cqlsh f852b1f5...19445a5c (2): > Update scylla-driver version to 3.29.4 Update tools/cqlsh submodule for scylla-driver 3.29.4 The motivation for this update is to resolve a driver-side serialization bug that was blocking work on #26740. The bug affected vector<collection> types (e.g., vector<set<int>,1>) and is fixed in scylla-driver versions 3.29.2+. Refs #26704	2025-11-06 10:01:26 +02:00
Asias He	dbeca7c14d	repair: Add metric for time spent on tablet repair It is useful to check time spent on tablet repair. It can be used to compare incremental repair and non-incremental repair. The time does not include the time waiting for the tablet scheduler to schedule the tablet repair task. Fixes #26505 Closes scylladb/scylladb#26502	2025-11-06 10:00:20 +03:00
Dario Mirovic	c3a673d37f	audit: move storage helper creation from `audit::start` to `audit::audit` Extract storage helper creation into `create_storage_helper` function. Call this function from `audit::audit`. It will be called per shard inside `sharded<audit>::start` method. Refs #26022	2025-11-06 03:05:43 +01:00
Dario Mirovic	28c1c0f78d	audit: fix formatting in `audit::start_audit` Refs #26022	2025-11-06 03:05:17 +01:00
Dario Mirovic	549e6307ec	audit: unify `create_audit` and `start_audit` There is no need to have `create_audit` separate from `start_audit`. `create_audit` just stores the passed parameters, while `start_audit` does the actual initialization and startup work. Refs #26022	2025-11-06 03:05:06 +01:00
Calle Wilund	b0061e8c6a	gcp_object_storage_test: Re-enable parallelism. Re-enable parallel execution to get better logs. Note, this is somewhat wasteful, as we won't re-use test fixture here, but in the end, it is probably an improvement.	2025-11-05 15:07:26 +00:00
Wojciech Mitros	0a22ac3c9e	mv: don't mark the view as built if the reader produced no partitions When we build a materialized view we read the entire base table from start to end to generate all required view udpates. If a view is created while another view is being built on the same base table, this is optimized - we start generating view udpates for the new view from the base table rows that we're currently reading, and we read the missed initial range again after the previous view finishes building. The view building progress is only updated after generating view updates for some read partitions. However, there are scenarios where we'll generate no view updates for the entire read range. If this was not handled we could end up in an infinite view building loop like we did in https://github.com/scylladb/scylladb/issues/17293 To handle this, we mark the view as built if the reader generated no partitions. However, this is not always the correct conclusion. Another scenario where the reader won't encounter any partitions is when view building is interrupted, and then we perform a reshard. In this scenario, we set the reader for all shards to the last unbuilt token for an existing partition before the reshard. However, this partition may not exist on a shard after reshard, and if there are also no partitions with higher tokens, the reader will generate no partitions even though it hasn't finished view building. Additionally, we already have a check that prevents infinite view building loops without taking the partitions generated by the reader into account. At the end of stream, before looping back to the start, we advance current_key to the end of the built range and check for built views in that range. This handles the case where the entire range is empty - the conditions for a built view are: 1. the "next_token" is no greater than "first_token" (the view building process looped back, so we've built all tokens above "first_token") 2. the "current_token" is no less than "first_token" (after looping back, we've built all tokens below "first_token") If the range is empty, we'll pass these conditions on an empty range after advancing "current_key" to the end because: 1. after looping back, "next_token" will be set to `dht::minimum_token` 2. "current_key" will be set to `dht::ring_position::max()` In this patch we remove the check for partitions generated by the reader. This fixes the issue with resharding and it does not resurrect the issue with infinite view building that the check was introduced for. Fixes https://github.com/scylladb/scylladb/issues/26523 Closes scylladb/scylladb#26635	2025-11-05 17:02:32 +02:00
Petr Gusev	5bda226ff6	storage_proxy: use coroutine::maybe_yield(); This is a small "while at it" refactoring -- better to use coroutine::maybe_yield with co_await-s.	2025-11-05 14:38:19 +01:00
Petr Gusev	4578304b76	storage_proxy: use gates to track write handlers destruction In #26408 a write_handler_destroy_promise class was introduced to wait for abstract_write_response_handler instances destruction. We strived to minimize the memory footprint of abstract_write_response_handler, with write_handler_destroy_promise-es we required only a single additional int. It turned our that in some cases a lot of write handlers can be scheduled for deletion at the same time, in such cases the vector<write_handler_destroy_promise> can become big and cause 'oversized allocation' seastar warnings. Another concern with write_handler_destroy_promise-es was that they were more complicated than it was worth. In this commit we replace write_handler_destroy_promise with simple gates. One or more gates can be attached to an abstract_write_response_handler to wait for its destruction. We use utils::small_vector<gate::holder, 2> to store the attached gates. The limit 2 was chosen because we expect two gates at the same time in most cases. One is storage_proxy::_write_handlers_gate, which is used to wait for all handlers in cancel_all_write_response_handlers. Another one can be attached by a caller of cancel_write_handlers. Nothing stops several cancel_write_handlers to be called at the same time, but it should be rare. The sizeof(utils::small_vector<gate::holder, 2>) == 40, this is 40.0 / 488 * 100 ~ 8% increase in sizeof(abstract_write_response_handler), which seems acceptable. Fixes scylladb/scylladb#26788	2025-11-05 14:37:52 +01:00
Nadav Har'El	8a07b41ae4	test/cqlpy: add test confirming page_size=0 disables paging In pull request #26384 a discussion started whether page_size=0 really disables paging, or maybe one needs page_size=-1 to truly disable paging. The reason for that discussion was commit `08c81427b` that started to use page_size=-1 for internal unpaged queries, and commit `76b31a3` that incorrectly claimed that page_size>=0 means paging is enabled. This patch introduces a test that confirms that with page_size=0, paging is truly disabled - including the size-based (1MB) paging. The new test is Scylla-only, because Cassandra is anyway missing the size-based page cutoff (see CASSANDRA-11745). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26742	2025-11-05 15:52:16 +03:00
Tomasz Grabiec	f8879d797d	tablet_allocator: Avoid load balancer failure when replacing the last node in a rack Introduced in `9ebdeb2` The problem is specific to node replacing and rack-list RF. The culprit is in the part of the load balancer which determines rack's shard count. If we're replacing the last node, the rack will contain no normal nodes, and shards_per_rack will have no entry for the rack, on which the table still has replicas. This throws std::out_of_range and fails the tablet draining stage, and node replace is failed. No backport because the problem exists only on master. Fixes #26768 Closes scylladb/scylladb#26783	2025-11-05 15:49:51 +03:00
Avi Kivity	8e480110c2	dist: housekeeping: set python.multiprocessing fork mode to "fork" Python 3.14 changed the multiprocessing fork mode to "forkserver", presumably for good reasons. However, it conflicts with our relocatable Python system. "forkserver" forks and execs a Python process at startup, but it does this without supplying our relocated ld.so. The system ld.so detects a conflict and crashes. Fix this by switching back to "fork", which is sufficient for housekeeping's modest needs. Closes scylladb/scylladb#26831	2025-11-05 15:47:38 +03:00
Pavel Emelyanov	05d711f221	database_test: Simplify snapshot_with_quarantine_works() test The test collects Data files from table dir, then _all_ files from snapshot dir and then checks whether the former is the subset of the latter. Using std::includes over two sets makes the code much shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-05 15:35:28 +03:00
Pavel Emelyanov	c8492b3562	database_test: Improve snapshot_skip_flush_works test It has two inaccuracies. First, when checking the contents of table directory, it uses pre-populated expected list with "manifest.json" in it. Weird. Second, when cechking the contents of snapshot directory it doesn't check if the "schema.cql" is there. It's always there, but if something breaks in the future it may come unnoticed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-05 15:35:26 +03:00
Pavel Emelyanov	5a25d74b12	database_test: Simplify snapshot_works() tests No functional changes here, just make use of the new lister to shorten the code. A small side effect -- if the test fails because contents of directories changes, it will print the exact difference in logs, not just that N files are missing/present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-05 15:34:25 +03:00
Pavel Emelyanov	365044cdbb	database_test: Use collect_files() to remove files Some test cases remove files from table directory to perform some checks over the taken snapshots. Using collect_files() helper makes the code easier to read. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-05 15:34:24 +03:00
Pavel Emelyanov	e1f326d133	database_test: Use collectz_files() to count files in directory Some test cases want to see that there are more than one file in a directory, so they can just re-use the new helper. Much shorter this way. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-05 15:32:58 +03:00
Pavel Emelyanov	60d1f78239	database_test: Introduce collect_files() helper It returns a set of files in a given directoy. Will be used by all next patches. Implemented using directory_lister, not lister::scan_dir in order to help removing the latter one in the future. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-05 15:32:58 +03:00
Calle Wilund	6c6105e72e	test::pylib: Add azure (mock) testing to EAR matrix Fixes #26782 Adds a provider proxy for azure, using the existing mock server, now as a fixture.	2025-11-05 10:22:23 +00:00
Calle Wilund	b8a6b6dba9	test::boost::encryption_at_rest: Remove redundant azure test indent	2025-11-05 10:22:23 +00:00
Calle Wilund	10e591bd6b	test::boost::encryption_at_rest: Move azure tests to use fixture Fixes #26781 Makes the test independent of wrapping scripts. Note: retains the split into "real" and "mock" tests. For other tests, we either all mock, or allow the environment to select mock or real. Here we have them combined. More expensive, but otoh more thourough.	2025-11-05 10:22:22 +00:00
Calle Wilund	1d37873cba	test::lib: Add azure mock/real server fixture Wraps the real/mock azure server for test in a fixture. Note: retains the current test setup which explicitly runs some tests with "real" azure, if avail, and some always mock.	2025-11-05 10:22:22 +00:00
Calle Wilund	10041419dc	test::pylib::boost: Fix test gather to handle test suites Fixes #26775	2025-11-05 10:22:22 +00:00
Calle Wilund	565c701226	utils::gcp::object_storage: Fix typo in semaphore init Fixes #26776 Semaphore storage is ssize_t, not size_t.	2025-11-05 10:22:22 +00:00
Calle Wilund	2edf6cf325	test::boost::encryption_at_rest_test: Remove redundant indent Removed empty scope and reindents kms test using fixtures.	2025-11-05 10:22:22 +00:00
Calle Wilund	286a655bc0	test::boost::test_encryption_at_rest: Move to AWS KMS fixture for kms test Fixes #26780 Uses fake/real CI endpoint for AWS KMS tests, and moves these into a suite for sharing the mock server.	2025-11-05 10:22:22 +00:00
Calle Wilund	a1cc866f35	test::boost::test_encryption_at_rest: Reorder tests and helpers No code changes. Just reorders code to organize more by provider etc, prepping for fixtures and test suites.	2025-11-05 10:22:22 +00:00
Calle Wilund	af85b7f61b	ent::encryption: Make text helper routines take std::string Moving away from custom string type. Pure cosmetics.	2025-11-05 10:22:22 +00:00
Calle Wilund	1b0394762e	test::pylib::dockerized_service: Handle docker/podman bind error message If we run non-dbuild, docker/podman can/will cause first bind error, we should check these too.	2025-11-05 10:22:22 +00:00
Calle Wilund	0842b2ae55	test::lib::aws_kms_fixture: Add a fixture object to run mock AWS KMS Runs local-kms mock AWS KMS server unless overridden by env var. Allows tests to use real or fake AWS KMS endpoint and shared fixture for quicker execution.	2025-11-05 10:22:21 +00:00
Calle Wilund	98c060232e	test::lib::gcs_fixture: Only set port if running docker image + more retry Our connect can spuriously fail. Just retry.	2025-11-05 10:22:21 +00:00
Wojciech Mitros	977fa91e3d	view_building_coordinator: rollback tasks on the leaving tablet replica When a tablet migration is started, we abort the corresponding view building tasks (i.e. we change the state of those tasks to "ABORTED"). However, we don't change the host and shard of these tasks until the migration successfully completes. When for some reason we have to rollback the migration, that means the migration didn't finish and the aborted task still has the host and shard of the migration source. So when we recreate tasks that should no longer be aborted due to a rolled-back migration, we should look at the aborted tasks of the source (leaving) replica. But we don't do it and we look at the aborted tasks of the target replica. In this patch we adjust the rollback mechanism to recreate tasks for the migration source instead of destination. We also fix the test that should have detected this issue - the injection that the test was using didn't make us rollback, but we simply retried a stage of the tablet migration. By using one_shot=False and adding a second injection, we can now guarantee that the migration will eventually fail and we'll continue to the 'cleanup_target' and 'revert_migration' stages. Fixes https://github.com/scylladb/scylladb/issues/26691 Closes scylladb/scylladb#26825	2025-11-05 10:44:06 +01:00
Pavel Emelyanov	2cb98fd612	Merge 'api: storage_service: tasks: unify sync and async compaction APIs' from Aleksandra Martyniuk Currently, all apis that start a compaction have two versions: synchronous and asynchronous. They share most of the implementation, but some checks and params have diverged. Unify the handlers of synchronous and asynchronous cleanup, major compaction, and upgrade_sstables. Fixes: https://github.com/scylladb/scylladb/issues/26715. Requires backports to all live versions Closes scylladb/scylladb#26746 * github.com:scylladb/scylladb: api: storage_service: tasks: unify upgrade_sstable api: storage_service: tasks: force_keyspace_cleanup api: storage_service: tasks: unify force_keyspace_compaction	2025-11-05 10:47:14 +03:00
Pavel Emelyanov	59019bc9a9	Merge 'Alternator: allow warning on auth errors before enabling enforcement' from Nadav Har'El An Alternator user was recently "bit" when switching `alternator_enforce_authorization` from "false" to "true": ְְְAfter the configuration change, all application requests suddenly failed because unbeknownst to the user, their application used incorrect secret keys. This series introduces a solution for users who want to safely switch `alternator_enforce_authorization` from "false" to "true": Before switching from "false" to "true", the user can temporarily switch a new option, `alternator_warn_authorization`, to true. In this "warn" mode, authentication and authorization errors are counted in metrics (`scylla_alternator_authentication_failures` and `scylla_alternator_authorization_failures`) and logged as WARNings, but the user's application continues to work. The user can use these metrics or log messages to learn of errors in their application's setup, fix them, and only do the switch of `alternator_enforce_authorization` when the metrics or log messages show there are no more errors. The first patch is the implementation of the the feature - the new configuration option, the metrics and the log messages, the second patch is a test for the new feature, and the third patch is documentation recommending how to use the warn mode and the associated metrics or log messages to safely switch `alternaor_enforce_authorization` from false to true. Fixes #25308 This is a feature that users need, so it should probably be backported to live branches. Closes scylladb/scylladb#25457 * github.com:scylladb/scylladb: docs/alternator: explain alternator_warn_authorization test/alternator: tests for new auth failure metrics and log messages alternator: add alternator_warn_authorization config	2025-11-05 10:45:17 +03:00
Pavel Emelyanov	fc37518aff	test: Check file existence directly There's a test that checks if temporary-statistics file is gone at some point. It does it by listing the directory it expects the file to be in and then comparing the names met with the temp. stat. file name. It looks like a single file_exists() call is enough for that purpose. As a "sanity" check this patch adds a validation that non-temporary statistics file is there, all the more so this file is removed after the test. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26743	2025-11-04 19:37:55 +01:00
Avi Kivity	95700c5f7f	Merge 'Support counters with tablets' from Michael Litvak Support the counters feature in tablets keyspaces. The main change is to fix the counter update during tablets intranode migration. Counter cell is c = map<host_id, value>. A counter update is applied by doing read-modify-write on a leader replica to retrieve the current host's counter value and transform the mutation to contain the updated value for the host, then apply the mutation and replicate it to other hosts. the read-modify-write is protected against concurrent updates by locking the counter cell. When the counter is migrated between two shards, it's not enough to lock the counter on the read shard, because in the stage write_both_read_new the read shard is switched, and then we can have concurrent updates reach either the old or the new shard. In order to keep the counter update exclusive we lock both shards when in the stage write_both_read_new. Also, when applying the transformed mutation we need to respect write_both stages and apply the mutation on both shards. We change it to use `apply_on_shards` similarly to other methods in storage proxy. The change applies to both tablets and vnodes, they use the same implementation, but for vnodes the behavior should remain equivalent up to some small reordering of the code since it doesn't have intranode migration and reduces to single read shard = write shard. Fixes https://github.com/scylladb/scylladb/issues/18180 no backport - new feature Closes scylladb/scylladb#26636 * github.com:scylladb/scylladb: docs: counters now work with tablets pgo: enable counters with tablets test: enable counters tests with tablets test: add counters with tablets test cql3: remove warning when creating keyspace with tablets cql3: allow counters with tablets storage_proxy: lock all read shards for counter update storage_proxy: apply counter mutation on all write shards storage_proxy: move counter update coordination to storage proxy storage_proxy: refactor mutate_counter_on_leader replica/db: add counter update guard replica/db: split counter update helper functions	2025-11-03 22:28:10 +01:00
Raphael S. Carvalho	7f34366b9d	sstables_loader: Don't bypass synchronization with busy topology The patch `c543059f86` fixed the synchronization issue between tablet split and load-and-stream. The synchronization worked only with raft topology, and therefore was disabled with gossip. To do the check, storage_service::raft_topology_change_enabled() but the topology kind is only available/set on shard 0, so it caused the synchronization to be bypassed when load-and-stream runs on any shard other than 0. The reason the reproducer didn't catch it is that it was restricted to single cpu. It will now run with multi cpu and catch the problem observed. Fixes #22707 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#26730	2025-11-03 18:10:08 +01:00
Michael Litvak	8555fd42df	docs: counters now work with tablets Counters are now supported in tablet-enabled keyspaces, so remove the documentation that listed counters as an unsupported feature and the note warning users about the limitation.	2025-11-03 16:04:37 +01:00
Michael Litvak	1337f4213f	pgo: enable counters with tablets Now that counters are supported with tablets, update the keyspace statement for counters to allow it to run with tablets.	2025-11-03 16:04:37 +01:00
Michael Litvak	1dbf53ca29	test: enable counters tests with tablets Enable all counters-related tests that were disabled for tablets because counters was not supported with tablets until now. Some tests were parametrized to run with both vnodes and tablets, and the tablets case was skipped, in order to not lose coverage. We change them to run with the default configuration since now counters is supported with both vnodes and tablets, and the implementation is the same, so there is no benefit in running them with both configurations.	2025-11-03 16:04:37 +01:00
Michael Litvak	a6c12ed1ef	test: add counters with tablets test add a new test for counters with tablets to test things that are specific to tablets. test counter updates that are concurrent with tablet internode and intranode migrations and verify it remains consistent and no updates are lost.	2025-11-03 16:04:37 +01:00
Michael Litvak	60ac13d75d	cql3: remove warning when creating keyspace with tablets When creating a keyspace with tablets, a warning is shown with all the unsupported features for tablets, which is only counters currently. Now that counters is also supported with tablets, we can remove this warning entirely.	2025-11-03 16:04:37 +01:00
Michael Litvak	9208b2f317	cql3: allow counters with tablets Now that counters work with tablets, allow to create a table with counters in a tablets-enabled keyspace, and remove the warning about counters not being supported when creating a keyspace with tablets. We allow to use counters with tablets only when all nodes are upgraded and support counters with tablets. We add a new feature flag to determine if this is the case. Fixes scylladb/scylladb#18180	2025-11-03 16:04:37 +01:00
Michael Litvak	296b116ae2	storage_proxy: lock all read shards for counter update Previously in a counter update we lock the read shard to protect the counter's read-modify-write against concurrent updates. This is not sufficient when the counter is migrated between different shards, because there is a stage where the read shard switches from the old shard to the new shard, and during that switch there can be concurrent counter updates on both shards. If each shard takes only its own lock, the operations will not be exclusive anymore, and this can cause lost counter updates. To fix this, we acquire the counter lock on both shards in the stage write_both_read_new, when both shards can serve reads. This guarantees that counter updates continue to be exclusive during intranode migration.	2025-11-03 16:04:35 +01:00
Michael Litvak	de321218bc	storage_proxy: apply counter mutation on all write shards When applying a counter mutation, use apply_on_shards to apply the mutation on all write shards, similarly to the way other mutations are applied in the storage proxy. Previously the mutation was applied only on the current shard which is the read shard. This is needed to respect the write_both stages of intranode migration where we need to apply the mutation on both the old and the new shards.	2025-11-03 16:03:29 +01:00
Michael Litvak	c7e7a9e120	storage_proxy: move counter update coordination to storage proxy Refactor the counter update to split the functions and have them called by the storage proxy to prepare for a later change. Previously in mutate_counter the storage proxy calls the replica function apply_counter_update that does a few things: 1. checks that the operation can be done: check timeout, disk utilization 2. acquire counter locks 3. do read-modify-write and transform the counter mutation 4. apply the mutation in the replica In this commit we change it so that these functions are split and called from the storage proxy, so that we have better control from the storage proxy when we change it later to work across multiple shards. For example, we will want to acquire locks on multiple shards, transform it on one shard, and then apply the mutation on multiple shards. After the change it works as follows in storage proxy: 1. acquire counter locks 2. call replica prepare to check the operation and transform the mutation 3. call replica apply to apply the transformed mutation	2025-11-03 15:59:46 +01:00
Tomasz Grabiec	e878042987	Revert "Revert "tests(lwt): new test for LWT testing during tablet resize"" This reverts commit `6cb14c7793`. The issue causing the previous revert was fixed in `88765f627a`.	2025-11-03 10:38:00 +01:00
Michael Litvak	579031cfc8	storage_proxy: refactor mutate_counter_on_leader Slightly reorganize the mutate counter function to prepare it for a later change. Move the code that finds the read shard and invokes the rest of the function on the read shard to the caller function. This simplifies the function mutate_counter_on_leader_and_replicate which now runs on the read shard and will make it easier to extend.	2025-11-03 08:43:11 +01:00
Michael Litvak	7cc6b0d960	replica/db: add counter update guard Add a RAII guard for counter update that holds the counter locks and the table operation, and extract the creation of the guard to a separate function. This prepares it for a later change where we will want to obtain the guard externally from the storage proxy.	2025-11-03 08:43:11 +01:00
Michael Litvak	88fd9a34c4	replica/db: split counter update helper functions Split do_apply_counter_update to a few smaller and simpler functions to help prepare for a later change.	2025-11-03 08:43:11 +01:00
Avi Kivity	9b6ce030d0	sstables: remove quadratic (and possibly exponential) compile time in parse() parse() taking a list of elements is quadratic (during compile time) in that it generates recursive calls to itself, each time with one fewer parameter. The total size of the parameter lists in all these generated functions is quadratic in the initial parameter list size. It's also exponential if we ignore inlining limits, since each .then() call expands to two branches - a ready future branch and a non-ready future branch. If the compiler did not give up, we'd have 2^list_len branches. For sure the compiler does not do so indefinitely, but the effort getting there is wasted. Simplify by using a fold expression over the comma operator. Instead of passing the remaining parameter list in each step, we pass only the parameter we are processing now, making processing linear, and not generating unnecessary functions. It would be better expressed using pack expansion statements, but these are part of C++26. The largest offender is probably stats_metadata, with 21 elements. dev-mode sstables.o: text data bss dec hex filename 1760059 1312 7673 1769044 1afe54 sstables.o.before 1745533 1312 7673 1754518 1ac596 sstables.o.after We save about 15k of text with presumably a corresponding (small) decrease in compile time. Closes scylladb/scylladb#26735	2025-11-02 13:09:37 +01:00
Jenkins Promoter	cb30eb2e21	Update pgo profiles - aarch64	2025-11-01 05:23:52 +02:00
Jenkins Promoter	e3a0935482	Update pgo profiles - x86_64	2025-11-01 04:54:49 +02:00
Petr Gusev	88765f627a	paxos_state: get_replica_lock: remove shard check This check is incorrect: the current shard may be looking at the old version of tablets map: * an accept RPC comes to replica shard 0, which is already at write_both_read_new * the new shard is shard 1, so paxos_state::accept is called on shard 1 * shard 1 is still at "streaming" -> shards_ready_for_reads() returns old shard 0 Fixes scylladb/scylladb#26801 Closes scylladb/scylladb#26809	2025-10-31 21:37:39 +01:00
Avi Kivity	7a72155374	Merge 'Introduce nodetool excludenode' from Tomasz Grabiec If a node is dead and cannot be brought back, tablet migrations are stuck, until the node is explicitly marked as "permanently dead" / "ignored node" / "excluded" (name differs in different contexts). Currently, this is done during removenode and replace operations but it should be possible to only mark the node as dead, for the purpose of unblocking migrations or other topology operations, without doing the actual removenode, because full removal might be currently impossible, or not desirable due to lack of capacity or priorities. This patch introduces this kind of API: ``` nodetool excludenode <host-id> [ ... <host-id> ] ``` Having this kind of API is an improvement in user experience in several cases. For example, when we lose a rack, the only viable option for recovery is to run removenode with an extra --ignore-dead-nodes option. This removenode will fail in the tablet draining phase, as there is no live node in the rack to rebuild replicas in. This is confusing to the operator. But necessary before ALTER KEYSPACE can proceed in order to change replication options to drop the rack from RF. Having this API allows operators to have more unified procedures, where "nodetool excludenode" is always the first step of recovery, which unblocks further topology operations, both those which restore capacity, but also auto-scaling, tablet split/merge, load balancing, etc. Fixes #21281 The PR also changes "nodetool status" to show excluded nodes, they have 'X' in their status instead of 'D'. Closes scylladb/scylladb#26659 * github.com:scylladb/scylladb: nodetool: status: Show excluded nodes as having status 'X' test: py: Test scenario involving excludenode API nodetool: Introduce excludenode command	2025-10-31 22:14:57 +02:00
Avi Kivity	d458dd41c6	Merge 'Avoid input_/output_stream-s default initialization and move-assignment' from Pavel Emelyanov Recent seastar update deprecated in/out streams usage pattern when a stream is default constructed early and them move-assigned with the proper one (see scylladb/seastar#3051). This PR fixes few places in Scylla that still use one. Adopting newer seastar API, no need to backport Closes scylladb/scylladb#26747 * github.com:scylladb/scylladb: commitlog: Remove unused work::r stream variable ec2_snitch: Fix indentation after previous patch ec2_snitch: Coroutinize the aws_api_call_once() sstable: Construct output_stream for data instantly test: Don't reuse on-stack input stream	2025-10-31 21:22:41 +02:00
Avi Kivity	adf9c426c2	Merge 'db/config: Change default SSTable compressor to LZ4WithDictsCompressor' from Nikos Dragazis `sstable_compression_user_table_options` allows configuring a node-global SSTable compression algorithm for user tables via scylla.yaml. The current default is LZ4Compressor (inherited from Cassandra). Make LZ4WithDictsCompressor the new default. Metrics from real datasets in the field have shown significant improvements in compression ratios. If the dictionary compression feature is not enabled in the cluster (e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the feature is enabled, flip the default back to the dictionary compressor using with a listener callback. Fixes #26610. Closes scylladb/scylladb#26697 * github.com:scylladb/scylladb: test/cluster: Add test for default SSTable compressor db/config: Change default SSTable compressor to LZ4WithDictsCompressor db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl boost/cql_query_test: Get expected compressor from config	2025-10-31 21:15:18 +02:00
Lakshmi Narayanan Sreethar	3eb7193458	backlog_controller: compute backlog even when static shares are set The compaction manager backlog is exposed via metrics, but if static shares are set, the backlog is never calculated. As a result, there is no way to determine the backlog and if the static shares need adjustment. Fix that by calculating backlog even when static shares are set. Fixes #26287 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26778	2025-10-31 18:18:36 +02:00
Michał Hudobski	fd521cee6f	test: fix typo in vector_index test Unfortunately in https://github.com/scylladb/scylladb/pull/26508 a typo than changes a behavior of a test was introduced. This patch fixes the typo. Closes scylladb/scylladb#26803	2025-10-31 18:35:02 +03:00
Tomasz Grabiec	284c73d466	scripts: pull_github_pr.sh: Fix auth problem detection Before the patch, the script printed: parse error: Invalid numeric literal at line 2, column 0 Closes scylladb/scylladb#26818	2025-10-31 18:32:58 +03:00
Michael Litvak	e7dbccd59e	cdc: use chunked_vector instead of vector for stream ids use utils::chunked_vector instead of std::vector to store cdc stream sets for tablets. a cdc stream set usually represents all streams for a specific table and timestamp, and has a stream id per each tablet of the table. each stream id is represented by 16 bytes. thus the vector could require quite large contiguous allocations for a table that has many tablets. change it to chunked_vector to avoid large contiguous allocations. Fixes scylladb/scylladb#26791 Closes scylladb/scylladb#26792	2025-10-31 13:02:34 +01:00
Tomasz Grabiec	1c0d847281	Merge 'load_balancer: load_stats reconcile after tablet migration and table resize' from Ferenc Szili This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations. This is the second part of the size based load balancing changes: - First part for tablet size collection via load_stats: #26035 - Second part reconcile load_stats: #26152 - The third part for load_sketch changes: #26153 - The fourth part which performs tablet load balancing based on tablet size: #26254 This is a new feature and backport is not needed. Closes scylladb/scylladb#26152 * github.com:scylladb/scylladb: load_balancer: load_stats reconcile after tablet migration and table resize load_stats: change data structure which contains tablet sizes	2025-10-31 09:58:25 +01:00
Tomasz Grabiec	2bd173da97	nodetool: status: Show excluded nodes as having status 'X' Example: $ build/dev/scylla nodetool status Datacenter: dc1 =============== Status=Up/Down/eXcluded \|/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 127.0.0.1 783.42 KB 1 ? 753cb7b0-1b90-4614-ae17-2cfe470f5104 rack1 XN 127.0.0.2 785.10 KB 1 ? 92ccdd23-5526-4863-844a-5c8e8906fa55 rack2 UN 127.0.0.3 708.91 KB 1 ? 781646ad-c85b-4d77-b7e3-8d50c34f1f17 rack3	2025-10-31 09:03:20 +01:00
Tomasz Grabiec	87492d3073	test: py: Test scenario involving excludenode API	2025-10-31 09:03:20 +01:00
Tomasz Grabiec	55ecd92feb	nodetool: Introduce excludenode command If a node is dead and cannot be brought back, tablet migrations are stuck, until the node is explicitly marked as "permanently dead" / "ignored node" / "excluded" (name differs in different contexts). Currently, this is done during removenode and replace operations but it should be possible to only mark the node as dead, for the purpose of unblocking migrations or other topology operations, without doing the actual removenode, because full removal might be currently impossible, or not desirable due to lack of capacity or priorities. This patch introduces this kind of API: nodetool excludenode <host-id> [ ... <host-id> ] Having this kind of API is an improvement in user experience in several cases. For example, when we lose a rack, the only viable option for recovery is to run removenode with an extra --ignore-dead-nodes option. This removenode will fail in the tablet draining phase, as there is no live node in the rack to rebuild replicas in. This is confusing to the operator. But necessary before ALTER KEYSPACE can proceed in order to change replication options to drop the rack from RF. Having this API allows operators to have more unified procedures, where "nodetool excludenode" is always the first step of recovery, which unblocks further topology operations, both those which restore capacity, but also auto-scaling, tablet split/merge, load balancing, etc. Fixes #21281	2025-10-31 09:03:20 +01:00
Avi Kivity	04a289cae6	Merge 'Auto expand to rack list' from Tomasz Grabiec We want to move towards rack-list based replication factor for tablets being the default mode, and in the future the only supported mode. This PR is a step towards that. We auto-expand numeric RF to rack list on keyspace creation and ALTER when rf_rack_valid_keyspaces option is enabled. The PR is mostly about adjusting tests. The main logic change is in the last patch, which modifies option post-processing in ks_prop_defs. Fixes #26397 Closes scylladb/scylladb#26692 * github.com:scylladb/scylladb: cql3: ks_prop_defs: Expand numeric RF to rack list locator: Move rack_list to topology.hh alternator: Do not set RF for zero-token DCs alternator: Switch keyspace creation to use ks_prop_defs test: alternator: Adjust for rack lists cql3: Move validation of invalid ALTER KEYSPACE earlier, to ks_prop_defs test: cqlpy: Mark tests using rack lists as scylla-only test: Switch to rack-list based RF test: Generalize tests to work with both numeric RF and rack lists test: cluster: test_zero_token_nodes_multidc: Adjust to rack list RF test: Prepare for handling errors specific to rack list path test: cluster: dtest: alternator: Force RF=1 in test_putitem_contention test: Create cluster with multiple racks in multi-dc setups test: boost: network_topology_strategy_test: Adjust to rack-list RF test: tablets: Adjust to rack list test: cluster: test_group0_schema_versioning: Use smaller RF to respect rf-rack-validness test: tablets_test: Convert test_per_shard_goal_mixed_dc_rf to be rack-valid test: object_store: test_backup: Adjust for rack lists test: cluster: tablets: Do not move tablet across racks in test_tablet_transition_sanity test: cluster: mv: Do not move tablets across racks test: cluster: util: Fix docstring for parse_replication_options() tablets, topology_coordinator: Skip tablet draining on replace	2025-10-30 21:54:08 +02:00
Avi Kivity	c0222e4d3c	Merge 'replica/table: do not stop major compaction when disabling auto compaction' from Lakshmi Narayanan Sreethar When auto compaction is disabled, all ongoing compactions, including major compactions, are stopped. However, major compactions should not be stopped, since the disable request applies only to regular auto compactions. This PR fixes the issue by tagging major compaction tasks with a newly introduced `compaction_type::Major` enum. Since `table::disable_auto_compaction()` already requests the compaction manager to stop only tasks of type `compaction_type::Compaction`, major compactions will no longer be stopped. Fixes #24501 PR improves how the compactions are stopped when a disable auto compaction request is executed. No need to backport Closes scylladb/scylladb#26288 * github.com:scylladb/scylladb: replica/table: do not stop major compaction when disabling auto compaction compaction/compaction_descriptor: introduce compaction_type::Major	2025-10-30 21:45:57 +02:00
Nikos Dragazis	a0bf932caa	test/cluster: Add test for default SSTable compressor The previous patch made the default compressor dependent on the SSTABLE_COMPRESSION_DICTS feature: * LZ4Compressor if the feature is disabled * LZ4WithDictsCompressor if the feature is enabled Add a test to verify that the cluster uses the right default in every case. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-10-30 15:53:54 +02:00
Nikos Dragazis	2fc812a1b9	db/config: Change default SSTable compressor to LZ4WithDictsCompressor `sstable_compression_user_table_options` allows configuring a node-global SSTable compression algorithm for user tables via scylla.yaml. The current default is `LZ4Compressor` (inherited from Cassandra). Make `LZ4WithDictsCompressor` the new default. Metrics from real datasets in the field have shown significant improvements in compression ratios. If the dictionary compression feature is not enabled in the cluster (e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the feature is enabled, flip the default back to the dictionary compressor using with a listener callback. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-10-30 15:53:49 +02:00
Pavel Emelyanov	395e275e03	Merge 'test/cluster/random_failures: Adjust to RF-rack-validity' from Dawid Mędrek We adjust the test to RF-rack-validity and then re-enable index random events, which requires the configuration option `rf_rack_valid_keyspaces` to be enabled. Fixes scylladb/scylladb#26422 Backport: I'd rather not backport these changes. They're almost a hack and poses too much risk for little gain. Closes scylladb/scylladb#26591 * github.com:scylladb/scylladb: test/cluster/random_failures: Re-enable index events test/cluster/random_failures: Enable rf_rack_valid_keyspaces test/cluster/random_failures: Adjust to RF-rack-validity	2025-10-30 15:39:38 +03:00
Aleksandra Martyniuk	fdd623e6bc	api: storage_service: tasks: unify upgrade_sstable Currently, all apis that start a compaction have two versions: synchronous and asynchronous. They share most of the implementation, but some checks and params have diverged. Unify the handlers of /storage_service/keyspace_upgrade_sstables/{keyspace} and /tasks/compaction/keyspace_upgrade_sstables/{keyspace}.	2025-10-30 11:42:48 +01:00
Aleksandra Martyniuk	044b001bb4	api: storage_service: tasks: force_keyspace_cleanup Currently, all apis that start a compaction have two versions: synchronous and asynchronous. They share most of the implementation, but some checks and params have diverged. Unify the handlers of /storage_service/keyspace_cleanup/{keyspace} and /tasks/compaction/keyspace_cleanup/{keyspace}.	2025-10-30 11:42:47 +01:00
Aleksandra Martyniuk	12dabdec66	api: storage_service: tasks: unify force_keyspace_compaction Currently, all apis that start a compaction have two versions: synchronous and asynchronous. They share most of the implementation, but some checks and params have diverged. Add consider_only_existing_data parameter to /tasks/compaction/keyspace_compaction/{keyspace}, to match the synchronous version of the api (/storage_service/keyspace_compaction/{keyspace}). Unify the handlers of both apis.	2025-10-30 11:33:17 +01:00
Tomasz Grabiec	6cb14c7793	Revert "tests(lwt): new test for LWT testing during tablet resize" This reverts commit `99dc31e71a`. The test is not stable due to #26801	2025-10-30 08:50:40 +01:00
Piotr Wieczorek	0398bc0056	test/alternator: Enable the tests failing because of #6918 The tests pass only with alternator_streams_strict_compatibility flag enabled, because of a suspected non-negligible performance impact (i.e. an additional entire-item comparison and type conversions). Refs https://github.com/scylladb/scylladb/issues/6918	2025-10-30 08:38:31 +01:00
Piotr Wieczorek	66ac66178b	alternator, cdc: Don't emit events for no-op removes Deletes that don't change the state of the database visible to the user (e.g. an attempt to delete a missing item) shouldn't produce a cdc log. This commit addresses this DynamoDB compatibility issue if the delete is a partition delete, a row delete, or a cell delete. It works under the assumption that the change was produced by Alternator. This means that it doesn't support range deletes, static row deletes, deletes of collection cells other than a map, etc. See also its parent commit, which introduces the methods that this commit extends. This commit handles the following cases: - `DeleteItem of nonexistent item: nothing`, - `BatchWriteItem.DeleteItem of nonexistent item: nothing`. Refs https://github.com/scylladb/scylladb/pull/26121	2025-10-30 08:38:30 +01:00
Piotr Wieczorek	a32e8091a9	alternator, cdc: Don't emit an event for equal items This commit adds a function that compares split mutations with the `row_state`, that was selected as a preimage or propagated through cdc options by a caller. If the items are equal, the corresponding log row isn't generated. The result being that creating an item with BatchWriteItem, PutItem, or UpdateItem doesn't emit an INSERT/MODIFY event if exactly identical item already exists. Comparing the items may be costly, so this logic is controlled by `alternator_streams_compabitiblity` flag. This commit handles the following cases: - `PutItem/UpdateItem/BatchWriteItem.PutItem of an existing and equal item: nothing`	2025-10-30 08:38:30 +01:00
Piotr Wieczorek	8c2f60f111	alternator/streams, cdc: Differentiate item replace and item update in CDC This commit improves compatibility with DynamoDB streams by changing the emitted events when creating/updating an item. Replace/update operations of an existing item emit a MODIFY, whereas replacing/updating a missing item results in an INSERT. If the state of the item doesn't change after applying the operation, no event is emitted. This commit handles the following cases: - `PutItem/UpdateItem/BatchWriteItem.PutItem of an existing and not equal item: MODIFY` - `PutItem/UpdateItem/BatchWriteItem.PutItem of a nonexistent item: INSERT` Refs https://github.com/scylladb/scylladb/issues/6918	2025-10-30 07:40:31 +01:00
Piotr Wieczorek	4f6aeb7b6b	alternator: Change the return type of rmw_operation_return Change the type from future<executor::request_return_type> to executor::request_return_type, because the method isn't async and one out of two callers unwraps the future immediately. This simplifies the code a little and probably saves a few instructions, since we suspect that moving a future<X> is more expensive than just moving X.	2025-10-30 07:40:31 +01:00
Piotr Wieczorek	ffdc8d49c7	config: Add alternator_streams_strict_compatibility flag With this flag enabled, Alternator Streams produces more accurate event types: - nop operations (i.e. replacing an item with an identical one, deleting a nonexistent item) don't produce an event, - updates of an existing item produce a MODIFY event, instead of INSERT, - etc. This flag affects the internal behaviour of some operations, i.e. Alternator may select a preimage and propagate it to CDC (in contrary to CDC making the request), or do extra item comparisons (i.e. compare the existing item with the new one). These operations may be costly, and users that don't use Streams won't need them. This flag is live-updatable. An operation reads this flag once, and uses its value for the entire operation.	2025-10-30 07:40:31 +01:00
Piotr Wieczorek	e3fde8087a	cdc: Don't split a row marker away from row cells CDC log table records a mutation as a sequence of log rows that record an atomic change (i.e. a row marker, tombstones, etc.), whereas a mutation in Alternator Streams always appears as a single log row. The type of operation is determined based on the type of the last log row in CDC. As a result, updates that create a row always appeared to Alternator Streams as an update (row marker + data), rather than an insert. This commit makes them a single log row. Its operation type is insert if it contains a row marker, and an update otherwise, which gives results consistent with DynamoDB Streams.	2025-10-30 07:40:31 +01:00
Tomasz Grabiec	28f6bdc99b	cql3: ks_prop_defs: Expand numeric RF to rack list Auto-exands numeric RF in CREATE/ALTER KEYSPACE statements for new DCs specified in the statement. Doesn't auto-expand existing options, as the rack choice may not be in line with current replica placement. This requires co-locating tablet replicas, and tracking of co-location state, which is not implemented yet. Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>	2025-10-29 23:32:59 +01:00
Tomasz Grabiec	35166809cb	locator: Move rack_list to topology.hh So that we can use it in locator/tablets.hh and avoid circular dependency between that header and abstract_replication_strategy.hh	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	f6dfea2fb1	alternator: Do not set RF for zero-token DCs That will fail with tablets because it won't be able to allocate replicas.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	21db21af7e	alternator: Switch keyspace creation to use ks_prop_defs So that we get the same validation and option post-processing as during regular keyspace creation. RF auto-expansion logic happens in ks_prop_defs, and we want that for tablets.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	7f66f67d95	test: alternator: Adjust for rack lists To achieve RF=3 with tablets and rf_rack_valid_keyspaces, we need 3 racks. So change the test to create 3 racks. Alternator was bypassing standard keyspace creation path, so it escaped validation. But this will change, and the test will stop wroking. Also, after auto-expansion of RF to rack list, not all of 4 nodes will host replicas. So need to adjust expectations.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	a88f70ce2c	cql3: Move validation of invalid ALTER KEYSPACE earlier, to ks_prop_defs Tests expect this failure in some scenarios, but later changes make us fail ealier due to topology constraints. As a rule, more general validation should come before more specific validation. So syntax validation before topology validation.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	8e69c65124	test: cqlpy: Mark tests using rack lists as scylla-only Those tests are intended to be also run against Cassandra, which doesn't support rack lists.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	ba53f41f59	test: Switch to rack-list based RF Have to do that before we enable auto-expansion of numeric RF to rack-lists, because those tests alter the replication factor, and altering from rack-list to numeric will not be allowed.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	d2e7d6fad2	test: Generalize tests to work with both numeric RF and rack lists	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	aa05f0fad0	test: cluster: test_zero_token_nodes_multidc: Adjust to rack list RF Two changes here: 1) Allocate nodes in dc2 in separeate racks to make the test stronger - it invites bugs where RF==nr_racks succeeds despite there being zero-token nodes, and not simply fail due to rack count. 2) Due to auto-expansion to rack list, scylla throws in keyspace creation rather than table creation.	2025-10-29 23:32:58 +01:00
Benny Halevy	e8b9f13061	test: Prepare for handling errors specific to rack list path	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	255f429a80	test: cluster: dtest: alternator: Force RF=1 in test_putitem_contention With rf_rack_valid_keyspaces enabled, RF of alternator tables will be equal to the number of racks (in this test: nodes). Prior to that, if number of nodes is smaller than 3, alternator creates the keyspace with RF=1. Turns out, with RF=2 the test fails with write timeouts due to contention. Enforce RF=1 by creating the table with one node before adding the second node.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	40e7543361	test: Create cluster with multiple racks in multi-dc setups To allow auto-expansion of numeric RF to rack list. Otherwise, keyspace creation will be rejected if rf-rack-valid keyspaces are enforced.	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	723622cf70	test: boost: network_topology_strategy_test: Adjust to rack-list RF	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	19d0beff38	test: tablets: Adjust to rack list test_decommission_rack_load_failure expects some tablets to land in the rack which only has the decommissioning node. Since the table uses RF=1, auto-expansion may choose the other rack and put all tablets there, and the expected failure will not happen. Force placement by using rack-list RF.	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	7ccc2a3560	test: cluster: test_group0_schema_versioning: Use smaller RF to respect rf-rack-validness	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	0f38f7185c	test: tablets_test: Convert test_per_shard_goal_mixed_dc_rf to be rack-valid	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	5962498983	test: object_store: test_backup: Adjust for rack lists With rack lists, not all nodes in a rack will receive streams if RF=1. Adjust expectations.	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	3b8a3823db	test: cluster: tablets: Do not move tablet across racks in test_tablet_transition_sanity Choose old_replica and new_replica so that they're both in rack r1. After later changes (rack list auto expansion), it's no longer guaranteed that the first replica will be on r1.	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	5bf7112fe6	test: cluster: mv: Do not move tablets across racks It's illegal with rf-rack-valid keyspaces.	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	e34548ccdb	test: cluster: util: Fix docstring for parse_replication_options() rack lists are now in replication_v2, which is also parsed with this function.	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	288e75fe22	tablets, topology_coordinator: Skip tablet draining on replace Replace doesn't drain (rebuild) tablets during topology change. They are rebuilt afterwards when the replaced node is in "left" state and replacing node is in normal state. So there is no point in attempting to drain, as nothing will be drained. Not only that, doing so has a risk, because the load balancer is invoked on a transitional topology state in which we can end up with no normal nodes in a rack. That's the case if the replaced node was the last one in the rack. This tripped one of the algorithms which computes rack's shard count for the purpose of determining ideal tablet count, it was not prepared to find an empty rack to which a table is still repliacated. That was fixed separately, but to avoid this, we better skip tablet draining here.	2025-10-29 23:32:57 +01:00
Taras Veretilnyk	c922256616	sstables: add overload of data_stream() to accept custom file_input_stream_options This patch introduces a new overload of 'sstable::data_stream()' that allows callers to provide their own 'file_input_stream_options'. This change will be useful in the next commit to enable integrity checking for file streaming.	2025-10-29 22:30:18 +01:00
Nikos Dragazis	96e727d7b9	db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl The option is a knob that allows to reject dictionary-aware compressors in the validation stage of CREATE/ALTER statements, and in the validation of `sstable_compression_user_table_options`. It was introduced in `7d26d3c7cb` to allow the admins of Scylla Cloud to selectively enable it in certain clusters. For more details, check: https://github.com/scylladb/scylla-enterprise/issues/5435 As of this series, we want to start offering dictionary compression as the default option in all clusters, i.e., treat it as a generally available feature. This makes the knob redundant. Additionally, making dictionary compression the default choice in `sstable_compression_user_table_options` creates an awkward dependency with the knob (disabling the knob should cause `sstable_compression_user_table_options` to fall back to a non-dict compressor as default). That may not be very clear to the end user. For these reasons, mark the option as "Deprecated", remove all relevant tests, and adjust the business logic as if dictionary compression is always available. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-10-29 20:13:08 +02:00
Nadav Har'El	aa34f0b875	alternator: fix CDC events for TTL expiration In commit `a3ec6c7d1d` we supposedly implemented the feature of telling TTL experation events from regular user-sent deletions. However, that implementation did not actually work at all... It had two bugs: 1. It created an null rjson::value() instead of an empty dictionary rjson::empty_object(), so GetRecords failed every time such a TTL expiration event was generated. 2. In the output, it used lowercase field names "type" and "principalId" instead of the uppercase "Type" and "PrincipalId". This is not the correct capitalization, and when boto3 recieves such incorrect fields it silently deletes them and never passes them to the user's get_records() call. This patch fixes those two bugs, and importantly - enables a test for this feature. We did already have such a test but it was marked as "veryslow" so doesn't run in CI and apparently not even run once to check the new feature. This test is not actually very long on Alternator when the TTL period is set very low (as we do in our tests), so I replaced the "veryslow" marker by "waits_for_expiration". The latter marker means that the test is still very slow - as much as half an hour - on DynamoDB - but runs quickly on Scylla in our test setup, and enabled in CI by default. The enabled test failed badly before this patch (a server error during GetRecords), and passes with this patch. Also, the aforementioned commit forgot to remove the paragraph in Alternator's compatibility.md that claims we don't have that feature yet. So we do it now. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26633	2025-10-29 17:08:20 +01:00
Piotr Wieczorek	2812e67f47	cdc: Emit a preimage for non-clustered tables Until this patch, CDC haven't fetched a preimage for mutations containing only a partition tombstone. Therefore, single-row deletions in a table witout a clustering key didn't include a preimage, which was inconsistent with single-row clustered deletions. This commit addresses this inconsistency. Second reason is compatibility with DynamoDB Streams, which doesn't support entire-partition deletes. Alternator uses partition tombstones for single-row deletions, though, and in these cases the 'OldImage' was missing from REMOVE records. Fixes https://github.com/scylladb/scylladb/issues/26382 Closes scylladb/scylladb#26578	2025-10-29 17:54:58 +02:00
Nadav Har'El	29ed1f3de7	Merge 'cql3: Refactor vector search select impl into a dedicated class' from Karol Nowacki cql3: Refactor vector search select impl into a dedicated class The motivation for this change is crash fixed in https://github.com/scylladb/scylladb/pull/25500. This commit refactors how ANN ordered select statements are handled to prevent a potential null pointer dereference and improve code organization. Previously, vector search selects were managed by `indexed_table_select_statement`, which unconditionally dereferenced a `view_ptr`. This assumption is invalid for vector search indexes where no view exists, creating a risk of crashes. To address this, the refactoring introduces the following changes: - A new `vector_indexed_table_select_statement` class is created to specifically handle ANN-ordered selects. This class operates without a view_ptr, resolving the null pointer risk. - The `indexed_table_select_statement` is renamed to `view_indexed_table_select_statement` to more accurately reflect its function with view-based indexes. - An assertion has been added to `indexed_table_select_statement` constructor to ensure view_ptr is not null, preventing similar issues in the future. Fixes: VECTOR-162 No backport is needed, as this is refactoring. Closes scylladb/scylladb#25798 * github.com:scylladb/scylladb: cql3: Rename indexed_table_select_statement cql3: Move vector search select to dedicated class	2025-10-29 17:49:24 +02:00
Lakshmi Narayanan Sreethar	7eac18229c	replica/table: do not stop major compaction when disabling auto compaction When auto compaction is disabled, all ongoing compactions, including major compactions, are stopped. However, major compactions should not be stopped, since the disable request applies only to regular auto compactions. This patch fixes the issue by tagging major compaction tasks with the newly introduced `compaction_type::MajorCompaction`. Since `table::disable_auto_compaction()` already requests the compaction manager to stop only tasks of type `compaction_type::Compaction`, major compactions will no longer be stopped. Fixes #24501 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-10-29 19:22:07 +05:30
Lakshmi Narayanan Sreethar	4d442f48db	compaction/compaction_descriptor: introduce compaction_type::Major Introduce a new compaction_type enum : `Major`. This type will be used by the next patches to differentiate between major compaction and regular compaction (compaction_type::Compaction). Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-10-29 19:21:53 +05:30
Nikos Dragazis	d95ebe7058	boost/cql_query_test: Get expected compressor from config Since `5b6570be52`, the default SSTable compression algorithm for user tables is no longer hardcoded; it can be configured via the `sstable_compression_user_table_options.sstable_compression` option in scylla.yaml. Modify the `test_table_compression` test to get the expected value from the configuration. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-10-29 14:52:43 +02:00
Piotr Dulikowski	aba922ea65	Merge 'cdc: improve cdc metadata loading' from Michael Litvak when loading CDC streams metadata for tablets from the tables, read only new entries from the history table instead of reading all entries. This improves the CDC metadata reloading, making it more efficient and predictable. the CDC metadata is loaded as part of group0 reload whenever the internal CDC tables are modified. on tablet split / merge, we create a new CDC timestamp and streams by writing them to the cdc_streams_history table by group0 operation, and when it's applied we reload the in-memory CDC streams map by reading from the tables and constructing the updated map. Previously, on every update, we would read the entire cdc_streams_history entries for the changed table, constructing all its streams and creating a new map from scratch. We improve this now by reading only new entries from cdc_streams_history and append them to the existing map. we can do this because we only append new entries to cdc_streams_history with higher timestamp than all previous entries. This makes this reloading more efficient and predictable, because previously we would read a number of entries that depends on the number of tablets splits and merges, which increases over time and is unbounded, whereas now we read only a single stream set on each update. Fixes https://github.com/scylladb/scylladb/issues/26732 backport to 2025.4 where cdc with tablets is introduced Closes scylladb/scylladb#26160 * github.com:scylladb/scylladb: test: cdc: extend cdc with tablets tests cdc: improve cdc metadata loading	2025-10-29 11:07:48 +01:00
Michał Hudobski	46589bc64c	secondary_index: disallow multiple vector indexes on the same column We currently allow creating multiple vector indexes on one column. This doesn't make much sense as we do not support picking one when making ann queries. To make this less confusing and to make our behavior similar to Cassandra we disallow the creation of multiple vector indexes on one column. We also add a test that checks this behavior. Fixes: VECTOR-254 Fixes: #26672 Closes scylladb/scylladb#26508	2025-10-29 11:55:38 +02:00
Patryk Jędrzejczak	7304afb75a	Merge 'vnodes cleanup: renames and code comments fixes' from Petr Gusev This is a follow-up for https://github.com/scylladb/scylladb/pull/26315. Fixes several review comments that were left unresolved in the original PR. backport: not needed, this PR contains only renames and code comment fixes Closes scylladb/scylladb#26745 * https://github.com/scylladb/scylladb: test_automatic_cleanup: fix comment storage_proxy: remove stale comment storage_proxy: improve run_fenceable_write comment topology_coordinator: rename start_cleanup_on_dirty_nodes -> start_vnodes_cleanup_on_dirty_nodes storage_service: rename is_cleanup_allowed -> is_vnodes_cleanup_allowed storage_service: rename do_cluster_cleanup -> do_clusterwide_vnodes_cleanup	2025-10-29 10:38:27 +01:00
Nadav Har'El	492c664fbb	docs/alternator: explain alternator_warn_authorization The previous patches added the ability to set alternator_warn_authorization. In this patch we add to our documentation a recommendation that this setting be used as an intermediate step when wanting to change alternator_enforce_authorization from "false" to "true". We explain why this is useful and important. The new documentation is in docs/alternator/compatibility.md, where we previously explained the alternator_enforce_authorization configuration. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-29 11:16:29 +02:00
Nadav Har'El	2dbd1a85a3	test/alternator: tests for new auth failure metrics and log messages This patch adds to test_metrics.py tests that authentication and authorization errors increment, respectively, the new metrics scylla_alternator_authentication_failures scylla_alternator_authorization_failures This patch also adds in test_logs.py tests that verify that that log messages are generated on different types of authentication/authorization failures. The tests also check how configuring alternator_enforce_authorization and alternator_warn_authorization changes these behaviors: * alternator_enforce_authorization determines whether an auth error will cause the request to fail, or the failure is counted but then ignored. * alternator_warn_authorization determines whether an auth error will cause a WARN-level log message to be generated (and also the failure is counted. * If both configuration flags are false, Alternator doesn't even attempt to check authentication or authorization - so errors aren't even counted. Because the new tests live-update the alternator_*_authorization configuration options, they also serve as a test that live-updating this option works correctly. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-29 11:16:29 +02:00
Nadav Har'El	51186b2f2c	alternator: add alternator_warn_authorization config Before this patch, the configuration alternator_enforce_authorization is a boolean: true means enforce authentication checks (i.e., each request is signed by a valid user) and authorization checks (the user who signed the request is allowed by RBAC to perform this request). This patch adds a second boolean configuration option, alternator_warn_authorization. When alternator_enforce_authorization is false but alternator_warn_authorization is true, authentication and authorization checks are performed as in enforce mode, but failures are ignored and counted in two new metrics: scylla_alternator_authentication_failures scylla_alternator_authorization_failures additionally,also each authentication or authorization error is logged as a WARN-level log message. Some users prefer those log messages over metrics, as the log messages contain additional information about the failure that can be useful - such as the address of the misconfigured client, or the username attempted in the request. All combinations of the two configuration options are allowed: * If just "enforce" is true, auth failures cause a request failure. The failures are counted, but not logged. * If both "enforce" and "warn" are true, auth failures cause a request failure. The failures are both counted and logged. * If just "warn" is true, auth failures are ignored (the request is allowed to compelete) but are counted and logged. * If neither "enforce" nor "warn" are true, no authentication or authorization check are done at all. So we don't know about failures, so naturally we don't count them and don't log them. This patch is fairly straightforward, doing mainly the following things: 1. Add an alternator_warn_authorization config parameter. 2. Make sure alternator_enforce_authorization is live-updatable (we'll use this in a test in the next patch). It "almost" was, but a typo prevented the live update from working properly. 3. Add the two new metrics, and increment them in every type of authentication or authorization error. Some code that needs to increment these new metrics didn't have access to the "stats" object, so we had to pass it around more. 4. Add log messages when alternator_warn_authorization is true. 5. If alternator_enforce_authorization is false, allow the auth check to allow the request to proceed (after having counted and/or logged the auth error). A separate patch will follow and add documentation suggesting to users how to use the new "warn" options to safely switch between non-enforcing to enforcing mode. Another patch will add tests for the new configuration options, new metrics and new log messages. Fixes #25308. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-29 11:16:26 +02:00
Dawid Mędrek	48cbf6b37a	test/cluster/test_tablets: Migrate dtest We migrate `tablets_test.py::TestTablets::test_moving_tablets_replica_on_node` from dtests to the repository of Scylla. We divide the test into two steps to make testing easier and even possible with RF-rack-valid keyspaces being enforced. Closes scylladb/scylladb#26285	2025-10-29 11:09:48 +02:00
Karol Nowacki	9f1fd7f5a0	cql3: Rename indexed_table_select_statement To align with `vector_indexed_table_select_statement`, this commit renames `indexed_table_select_statement` to `view_indexed_table_select_statement` to clarify its usage with materialized views.	2025-10-29 08:37:25 +01:00
Karol Nowacki	357c0a8218	cql3: Move vector search select to dedicated class The execution of SELECT statements with ANN ordering (vector search) was previously implemented within `indexed_table_select_statement`. This was not ideal, as vector search logic is independent of secondary index selects. This resulted in unnecessary complexity because vector search queries don't use features like aggregates or paging. More importantly, `indexed_table_select_statement` assumed a non-null `view_schema` pointer, which doesn't hold for vector indexes (where `view_ptr` is null). This caused null pointer dereferences during ANN ordered selects, leading to crashes (VECTOR-179). Other parts of the class still dereference `view_schema` without null checks. Moving the vector search select logic out of `indexed_table_select_statement` simplifies the code and prevents these null pointer dereferences.	2025-10-29 08:37:21 +01:00
Taras Veretilnyk	e62ebdb967	table: enable integrity checks for streaming reader Previously, streaming readers only verified the checksum of compressed SSTables. This patch extends checks to also include the digest and the uncompressed checksum (CRC). These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest. If the reader range doesn't cover the full SSTable, the digest check is skipped.	2025-10-28 19:27:35 +01:00
Taras Veretilnyk	06e1b47ec6	table: Add integrity option to table::make_sstable_reader()	2025-10-28 19:27:35 +01:00
Taras Veretilnyk	deb8e32e86	sstables: Add integrity option to create_single_key_sstable_reader Added an sstables::integrity_check parameter to create_single_key_sstable_reader methods across its implementations. This allows callers to enable SSTable integrity checks during single-key reads.	2025-10-28 19:27:35 +01:00
Petr Gusev	b6bcd062de	test_automatic_cleanup: fix comment	2025-10-28 17:55:20 +01:00
Petr Gusev	d49be677d5	storage_proxy: remove stale comment	2025-10-28 17:55:20 +01:00
Petr Gusev	c60223f009	storage_proxy: improve run_fenceable_write comment	2025-10-28 17:55:20 +01:00
Petr Gusev	58d100a0cb	topology_coordinator: rename start_cleanup_on_dirty_nodes -> start_vnodes_cleanup_on_dirty_nodes	2025-10-28 17:55:20 +01:00
Petr Gusev	fa9dc71f30	storage_service: rename is_cleanup_allowed -> is_vnodes_cleanup_allowed	2025-10-28 17:55:19 +01:00
Pavel Emelyanov	e99c8eee08	commitlog: Remove unused work::r stream variable Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-28 19:46:29 +03:00
Dawid Mędrek	5e03b01107	test/cluster: Add test_simulate_upgrade_legacy_to_raft_listener_registration We provide a reproducer test of the bug described in scylladb/scylladb#18049. It should fail before the fix introduced in scylladb/scylladb@7ea6e1ec0a, and it should succeed after it. Refs scylladb/scylladb#18049 Fixes scylladb/scylladb#18071 Closes scylladb/scylladb#26621	2025-10-28 17:32:15 +01:00
Pavel Emelyanov	92462e502f	ec2_snitch: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-28 19:31:08 +03:00
Pavel Emelyanov	7640ade04d	ec2_snitch: Coroutinize the aws_api_call_once() The method connects a socket, grabs in/out streams from it then writes HTTP request and reads+parses the response. For that it uses class variables for socket and streams, but there's no real need for that -- all three actually exists throughput the method "lifetime". To fix it, coroutinizes the method. The same could be achieved my moving the connected socket and streams into do_with() context, but coroutine is better than that. (indentation is left broken) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-28 19:29:25 +03:00
Pavel Emelyanov	5d89816fed	sstable: Construct output_stream for data instantly This changes makes local output_stream variable be constructed in the declaration statement with the help of ternary operator thus avoiding both -- default-initialization and move-assignment depending on the standalone condition checking. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-28 19:27:22 +03:00
Pavel Emelyanov	37b9cccc1c	test: Don't reuse on-stack input stream The test consists of several snippets, each creating an input_stream for some short operation and checking the result. Each snipped over-writes the local `input_stream in` variable with the new one. This change wraps each of those snippets into own code block in order to have own new `input_stream in` variable in each. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-28 19:25:07 +03:00
Yauheni Khatsianevich	99dc31e71a	tests(lwt): new test for LWT testing during tablet resize – Workload: N workers perform CAS updates UPDATE … SET s{i}=new WHERE pk=? IF (∀j≠i: s{j}>=guard_j) AND s{i}=prev at CL=LOCAL_QUORUM / SERIAL=LOCAL_SERIAL. Non-apply without timeout is treated as contention; “uncertainty” timeouts are resolved via LOCAL_SERIAL read. - Enable balancing and increase min_tablet_count to force split, flush and lower min_tablet_count to merge. - “Uncertainty” timeouts (write timeout due to uncertainty) are resolved via a LOCAL_SERIAL read to determine whether the CAS actually applied. - Invariants: after the run, for every pk and column s{i}, the stored value equals the number of confirmed CAS by worker i (no lost or phantom updates) despite ongoing tablet moves. Closes scylladb/scylladb#26113	2025-10-28 16:48:57 +01:00
Petr Gusev	d300adc10c	storage_service: rename do_cluster_cleanup -> do_clusterwide_vnodes_cleanup This cleanup is only for vnodes-based tables, reflect this in the function name.	2025-10-28 15:37:28 +01:00
Michael Litvak	4cc0a80b79	test: cdc: extend cdc with tablets tests extend and improve the tests of virtual tables for cdc with tablets. split the existing virtual tables test to one test that validates the virtual tables against the internal cdc tables, and triggering some tablet splits in order to create entries in the cdc_streams_history table, and add another test with basic validation of the virtual tables when there are multiple cdc tables.	2025-10-28 15:06:21 +01:00
Pavel Emelyanov	ae0136792b	utils: Make directory_lister use generator lister from seastar The directory_lister uses utils::lister under the hood which accepts a callback to put directory_entry-s in. The directory_lister's callback then puts the entries into a queue and its .get() method pops up entries from there to return to caller. This patch simplifies this code by switching the directory_lister to use experimental generator lister from seastar. With it, the entries to be returned from .get() are simply co_await-ed from calling the generator object (wich co_yield-s them). As a result the directory_lister becomes smaller and drops the need for utils::lister. Since directory_lister was created as a replacement for that callback-based lister, the latter can be eventually removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26586	2025-10-28 15:20:20 +02:00
Pavel Emelyanov	948cefa5f9	test: Extend API consistency test with tokens_endpoint endpoint Recently (#26231) there was added a test to check that several API endpoints, that return tokens and corresponding replica nodes, are consistent with tablet map. This patch adds one more API endpoint to the validation -- the /storage_service/tokens_endpoint one. The extention is pretty straightforward, but the new endpoint returns back a single (primary) replica for a token, so the test check is slightly modified to account for that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26580	2025-10-28 15:18:09 +02:00
Dawid Mędrek	535d31b588	test/cluster/random_failures: Re-enable index events We've enabled the configuration option `rf_rack_valid_keyspaces`, so we can finally re-enable the events creating and dropping secondary indexes. Fixes scylladb/scylladb#26422	2025-10-28 14:17:14 +01:00
Dawid Mędrek	b4898e50bf	test/cluster/random_failures: Enable rf_rack_valid_keyspaces Now that the test has been adjusted to work with the configuration option, we enable it.	2025-10-28 14:17:09 +01:00
Dawid Mędrek	59b2a41c49	test/cluster/random_failures: Adjust to RF-rack-validity We adjust the test to work with the configuration option `rf_rack_valid_keyspaces` enabled. For that, we ensure that there is always at least one node in each of the three racks. This way, all keyspaces we create and manipulate will remain RF-rack-valid since they all use RF=3. ------------------------------------------------------------------------ To achieve that, we only need to adjust the following events: 1. `init_tablet_transfer` The event creates a new keyspace and table and manually migrates a tablet belonging to it. As long as we make sure the migration occurs within the same rack, there will be no problem. Since RF == #racks, each rack will have exactly one tablet replica, so we can migrate the tablet to an arbitrary node in the same rack. Note that there must exist a node that's not a replica. If there weren't such a node, the test wouldn't have worked before this commit because it's not possible to migrate a tablet from one node being its replica to another. In other words, we have a guarantee that there are at least 4 nodes in the cluster when we try to migrate a tablet replica. That said, we check it anyway. If there's no viable node to migrate the tablet replica to, we log that information and do nothing. That should be an acceptable solution. 2. `add_new_node` As long as we add a node to an existing rack, there's no way to violate the invariant imposed by the configuration option, so we pick a random rack out of the existing three and create a node in it. 3. `decommission_node` We need to ensure that the node we'll be trying to decommission is not the only one in its rack. Following pretty much the same reasoning as in `init_tablet_transfer`, we conclude there must be a rack with at least two nodes in it. Otherwise we'd end up having to migrate a tablet from one replica node to another, which is not possible. What's more, decommissioning a node is not possible if any node in the cluster is dead, so we can assume that `manager.running_servers` returns the whole cluster. 4. `remove_node` The same as `decommission_node`. Just note although the node we choose to remove must be first stopped, none other node can be dead, so the whole cluster must be returned by `manager.running_servers`. ------------------------------------------------------------------------ There's one more important thing to note. The test may sometimes trigger a sequence of events where a new node is started, but, due to an error injection, its initialization is not completed. Among other things, the node may NOT have a host ID recognized by the rest of the nodes in the cluster, and operations like tablet migration will fail if they target it. Thankfully, there seems to be a way to avoid problems stemming from that. When a new node is added to the cluster, it should appear at the end of the list returned by `manager.running_servers`. This most likely stems from how dictionaries work in Python: "Keys and values are iterated over in insertion order." -- https://docs.python.org/3/library/stdtypes.html#dict-views and the fact that we keep track of running servers using a dictionary. Furthermore, we rely on the assumption that the test currently works correctly. Assume, to the contrary, that among the nodes taking part in the operations listed above, there is at most one node per rack that has its host ID recognized by the rest of the cluster. Note that only those nodes can store any tablets. Let's refer to the set of those nodes as X. Assume that we're dealing with tablet migration, decommissioning, or removing a node. Since those operations involve tablet migration, at least one tablet will need to be migrated from the node in question to another node in X. However, since X consists of at most three nodes, and one of them is losing its tablet, there is no viable target for the tablet, so the operation fails. Using those assumptions, an auxiliary function, `select_viable_rack`, was designed to carefully choose a correct rack, which we'll then pick nodes from to perform the topological operations. It's simple: we just find the first rack in the list that has at least two nodes in it. That should ensure that we perform an operation that doesn't lead to any unforeseen disaster. ------------------------------------------------------------------------ Since the test effectively becomes more complex due to more care for keeping the topology of the cluster valid, we extend the log messages to make them more helpful when debugging a failure.	2025-10-28 14:15:57 +01:00
Nadav Har'El	87573197d4	test/alternator: reproducers for missing headers and request limit This patch adds reproducing tests in test/alternator for issue #23438, which is about missing checks for the length of headers and the URL in Alternator requests. These should be limited, because Seastar's HTTP server, which Scylla uses, reads them into memory so they can OOM Scylla. The tests demonstrate that DynamoDB enforces a 16 KB limit on the headers and the URL of the request, but Scylla doesn't (a code inspection suggests it does not in fact have any limit). The two tests pass on DynamoDB and currently xfail on Alternator. Refs #23438. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23442	2025-10-28 15:12:25 +02:00
Pavel Emelyanov	d9bfbeda9a	lister: Fix race between readdir and stat Sometimes file::list_directory() returns entries without type set. In thase case lister calls file_type() on the entry name to get it. In case the call returns disengated type, the code assumes that some error occurred and resolves into exception. That's not correct. The file_type() method returns disengated type only if the file being inspected is missing (i.e. on ENOENT errno). But this can validly happen if a file is removed bettween readdir and stat. In that case it's not "some error happened", but a enry should be just skipped. In "some error happened", then file_type() would resolve into exceptional future on its own. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26595	2025-10-28 15:10:22 +02:00
Botond Dénes	ac618a53f4	Merge 'db: repair: do not update repair_time if batchlog replay failed' from Aleksandra Martyniuk Currently, batchlog replay is considered successful even if all batches fail to be sent (they are replayed later). However, repair requires all batches to be sent successfully. Currently, if batchlog isn't cleared, the repair never learns and updates the repair_time. If GC mode is set to "repair", this means that the tombstones written before the repair_time (minus propagation_delay) can be GC'd while not all batches were replied. Consider a scenario: - Table t has a row with (pk=1, v=0); - There is an entry in the batchlog that sets (pk=1, v=1) in table t; - The row with pk=1 is deleted from table t; - Table t is repaired: - batchlog reply fails; - repair_time is updated; - propagation_delay seconds passes and the tombstone of pk=1 is GC'd; - batchlog is replayed and (pk=1, v=1) inserted - data resurrection! Do not update repair_time if sending any batch fails. The data is still repaired. For tablet repair the repair runs, but at the end the exception is passed to topology coordinator. Thanks to that the repair_time isn't updated. The repair request isn't removed as well, due to which the repair will need to rerun. Apart from that, a batch is removed from the batchlog if its version is invalid or unknown. The condition on which we consider a batch too fresh to replay is updated to consider propagation_delay. Fixes: https://github.com/scylladb/scylladb/issues/24415 Data resurrection fix; needs backport to all versions Closes scylladb/scylladb#26319 * github.com:scylladb/scylladb: db: fix indentation test: add reproducer for data resurrection repair: fail tablet repair if any batch wasn't sent successfully db/batchlog_manager: fix making decision to skip batch replay db: repair: throw if replay fails db/batchlog_manager: delete batch with incorrect or unknown version db/batchlog_manager: coroutinize replay_all_failed_batches	2025-10-28 14:52:59 +02:00
Botond Dénes	f3cec5f11a	Merge 'index: Set tombstone_gc when creating underlying view' from Dawid Mędrek Before this commit, when the underlying materialized view was created, it didn't have the property `tombstone_gc` set to any value. We fix the bug in this PR. Implementation strategy: 1. Move code responsible for producing the schema of a secondary index to the file that handles `CREATE INDEX`. 2. Set the property when creating the view. 3. Add reproducer tests. Fixes scylladb/scylladb#26542 Backport: we can discuss it. Closes scylladb/scylladb#26543 * github.com:scylladb/scylladb: index: Set tombstone_gc when creating secondary index index: Make `create_view_for_index` method of `create_index_statement` index: Move code for creating MV of secondary index to cql3 db, cql3: Move creation of underlying MV for index	2025-10-28 14:42:42 +02:00
Nadav Har'El	c3593462a4	alternator: improve protection against oversized requests Following DynamoDB, Alternator also places a 16 MB limit on the size of a request. Such a limit is necessary to avoid running out of memory - because the AWS message authentication protocol requires reading the entire request into memory before its signature can be verified. Our implementation for this limit used Seastar's HTTP server's content_length_limit feature. However, this Seastar feature is incomplete - it only works when the request uses the Content-Length header, and doesn't do anything if the request doesn't have a Content-Length (it may use chunked encoding, or have no length at all). So malicious users can cause Scylla to OOM by sending a huge request without a Content-Length. So in this patch we stop using the incomplete Seastar feature, and implement the length limit in Scylla in a way that works correctly with or without Content-Length: We read from the input stream and if we go over 16MB, we generate an error. Because we dropped Seastar's protection against a long Content-Length, we also need to fix a piece of code which used Content-Length to reserve some semaphore units to prevent reading many large requests in parallel. We fix two problems in the code: 1. If Content-Length is over the limit, we shouldn't attempt to reserve semaphore units - this should just be a Payload Too Large error. 2. If Content-Length is missing, the existing code did nothing and had a TODO that we should. In this patch we implement what was suggested in that TODO: We temporarily reserve the whole 16 MB limit, and after reading the actual request, we return part of the reservation according to the real request size. That last fix is important, because typically the largest requests will be BatchWriteItem where a well-written client would want to use chunked encoding, not Content-Length, to avoid materializing the entire request up-front. For such clients, the memory use semaphore did nothing, and now it does the right thing. Note that this patch does not solve the problem #12166 that existed with Seastar's length-limiting implementation but still exists in the new in-Scylla length-limiting implementation: The fact we send an error response in the middle of the request and then close the connection, while the client continues to send the request, can lead to an RST being sent by the server kernel. Usually this will be fine - well-written client libraries will be able to read the response before the RST. But even with a well-written library in some rare timings the client may get the RST before the response, and will miss the response, and get an empty or partial response or "connection reset by peer". This issue existed before this patch, and still exists, but is probably of minor impact. Fixes #8196 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23434	2025-10-28 15:24:46 +03:00
Lakshmi Narayanan Sreethar	64c1ec99e0	cmake: link crypto lib to utils The utils library requires OpenSSL's libcrypto for cryptographic operations and without linking libcrypto, builds fail with undefined symbol errors. Fix that by linking `crypto` to `utils` library when compiled with cmake. The build files generated with configure.py already have `crypto` lib linked, so they do not have this issue. Fix #26705 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26707	2025-10-28 14:11:03 +02:00
Ferenc Szili	10f07fb95a	load_balancer: load_stats reconcile after tablet migration and table resize This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to issue migrations which improve load balance.	2025-10-28 12:12:09 +01:00
Anna Stuchlik	6fa342fb18	doc: add OS support for version 2025.4 Fixes https://github.com/scylladb/scylladb/issues/26450 Closes scylladb/scylladb#26616	2025-10-28 13:29:40 +03:00
Radosław Cybulski	ea6b22f461	Add max trace size output configuration variable In #24031 users complained, that trace message is truncated, namely it's no longer json parsable and table name might not be part of the output. This path enables users to configure maximum size of trace message. In case user wanted `table` name, but didn't care about message size, #26634 will help. - add configuration varable `alternator_max_users_query_size_in_trace_output` with default value of 4096 (4 times old default value). - modify `truncated_content_view` function to use new configuration variable for truncation limit - update `truncated_content_view` to consistently truncate at given size, previously trunctation would also happen when data arrived in more than one chunk - update `truncated_content_view` to better handle truncated value (limit number of copies) - fix `scylla_config_read` call - call to `query` for a configuration name that is not existing will return `Items` array empty (but present) - this would raise array access exception few lines below. - add test Refs #26634 Refs #24031 Closes scylladb/scylladb#26618	2025-10-28 13:29:15 +03:00
Pavel Emelyanov	ac1d709709	Merge 'tablet_sstable_streamer: defer SSTable unlinking until fully streamed' from Taras Veretilnyk When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets. The previous implementation unlinked SSTables immediately after streaming them for the first tablet, potentially making them partially unavailable for subsequent tablets. This patch replaces unlink() with mark_for_deletion() deferring actual unlinking till sstable::close_files. test_tablets2::test_tablet_load_and_stream was enhanced to also verify that SSTables are removed after being streamed. Fixes #26606 Backport is not required, although it is a bug fix, but it isn't visible. This is more of a preparatory fix for https://github.com/scylladb/scylladb/pull/26444. Closes scylladb/scylladb#26622 * github.com:scylladb/scylladb: test_tablets2: verify SSTable cleanup after tablet load and stream tablet_sstable_streamer: replace unlink() call with mark_for_deletion()	2025-10-28 13:25:40 +03:00
Patryk Jędrzejczak	5321720853	test: test_raft_recovery_stuck: reconnect driver after rolling restarts It turns out that #21477 wasn't sufficient to fix the issue. The driver may still decide to reconnect the connection after `rolling_restart` returns. One possible explanation is that the driver sometimes handles the DOWN notification after all nodes consider each other UP. Reconnecting the driver after restarting nodes seems to be a reliable workaround that many tests use. We also use it here. Fixes #19959 Closes scylladb/scylladb#26638	2025-10-28 13:24:23 +03:00
Anna Stuchlik	bd5b966208	doc: add --list-active-releases to Web Installer Fixes https://github.com/scylladb/scylladb/issues/26688 V2 of https://github.com/scylladb/scylladb/pull/26687 Closes scylladb/scylladb#26689	2025-10-28 13:21:57 +03:00
Pavel Emelyanov	54a117b19d	Merge 'retry_strategy: Switch to using seastar's retry_strategy (take two)' from Ernest Zaslavsky With the recent introduction of retry_strategy to Seastar, the pure virtual class previously defined in ScyllaDB is now redundant. This change allows us to streamline our codebase by directly inheriting from Seastar’s implementation, eliminating duplication in ScyllaDB. Despite this update is purely a refactoring effort and does not introduce functional changes it should be ported back to 2025.3 and 2025.4 otherwise it will make future backports of bugfixes/improvements related to `s3_client` near to impossible ref: https://github.com/scylladb/seastar/issues/2803 depends on: https://github.com/scylladb/seastar/pull/2960 Closes scylladb/scylladb#25801 * github.com:scylladb/scylladb: s3_client: remove unnecessary `co_await` in `make_request` s3 cleanup: remove obsolete retry-related classes s3_client: remove unused `filler_exception` s3_client: fix indentation s3_client: simplify chunked download error handling using `make_request` s3_client: reformat `make_request` functions for readability s3_client: eliminate duplication in `make_request` by using overload s3_client: reformat `make_request` function declarations for readability s3_client: reorder `make_request` and helper declarations s3_client: add `make_request` override with custom retry and error handler s3_client: migrate s3_client to Seastar HTTP client s3_client: fix crash in `copy_s3_object` due to dangling stream s3_client: coroutinize `copy_s3_object` response callback aws_error: handle missing `unexpected_status_error` case s3_creds: use Seastar HTTP client with retry strategy retry_strategy: add exponential backoff to `default_aws_retry_strategy` retry_strategy: introduce Seastar-based retry strategy retry_strategy: update CMake and configure.py for new strategy retry_strategy: rename `default_retry_strategy` to `default_aws_retry_strategy` retry_strategy: fix include retry_strategy: Copied utils/s3/retry_strategy.hh to utils/s3/default_aws_retry_strategy.hh retry_strategy: Copied utils/s3/retry_strategy.cc to utils/s3/default_aws_retry_strategy.cc	2025-10-28 13:08:42 +03:00
Patryk Jędrzejczak	820c8e7bc4	Merge 'LWT: use shards_ready_for_reads for replica locks' from Petr Gusev When a tablet is migrated between shards on the same node, during the write_both_read_new state we begin switching reads to the new shard. Until the corresponding global barrier completes, some requests may still use write_both_read_old erm, while others already use the write_both_read_new erm. To ensure mutual exclusion between these two types of requests, we must acquire locks on both the old and new shards. Once the global barrier completes, no requests remain on the old shard, so we can safely switch to acquiring locks only on the new shard. The idea came from the similar locking problem in the [counters for tablets PR](https://github.com/scylladb/scylladb/pull/26636#discussion_r2463932395). Fixes scylladb/scylladb#26727 backport: need to backport to 2025.4 Closes scylladb/scylladb#26719 * https://github.com/scylladb/scylladb: paxos_state: use shards_ready_for_reads paxos_state: inline shards_for_writes into get_replica_lock	2025-10-28 10:37:53 +01:00
Avi Kivity	d81796cae3	Merge 'Limit concurrent view updates from all sources' from Wojciech Mitros Before this patch, when a base table has many materialized views, each write to this table can start up to 128 view updates in parallel. With high client write concurrency, the actual concurrency of writes executed on the node may grow unexpectedly, which can lead to higher latency and higher memory usage compared to a sequential approach. In this patch we add a per-shard, per-service-level semaphore which limits the number of concurrent view updates processed on the shard in this service level to a constant value. We take one unit from the semaphore for each local view update write, and releasing it when it finishes. The remote view updates do not take units from the semaphore because they don't consume nearly as much processing power and they are limited by another semaphore based on their memory usage. Fixes https://github.com/scylladb/scylladb/issues/25341 Closes scylladb/scylladb#25456 * github.com:scylladb/scylladb: mv: limit concurrent view updates from all sources database: rename _view_update_concurrency_sem to _view_update_memory_sem	2025-10-28 11:13:24 +02:00
Aleksandra Martyniuk	910cd0918b	locator: use get_primary_replica for get_primary_endpoints Currently, tablet_sstable_streamer::get_primary_endpoints is out of sync with tablet_map::get_primary_replica. The get_primary_replica optimizes the choice of the replica so that the work is fairly distributes among nodes. Meanwhile, get_primary_endpoints always chooses the first replica. Use get_primary_replica for get_primary_endpoints. Fixes: https://github.com/scylladb/scylladb/issues/21883. Closes scylladb/scylladb#26385	2025-10-28 09:56:08 +02:00
Michael Litvak	8743422241	cdc: improve cdc metadata loading when loading CDC streams metadata for tablets from the tables, read only new entries from the history table instead of reading all entries. This improves the CDC metadata reloading, making it more efficient and predictable. the CDC metadata is loaded as part of group0 reload whenever the internal CDC tables are modified. on tablet split / merge, we create a new CDC timestamp and streams by writing them to the cdc_streams_history table by group0 operation, and when it's applied we reload the in-memory CDC streams map by reading from the tables and constructing the updated map. Previously, on every update, we would read the entire cdc_streams_history entries for the changed table, constructing all its streams and creating a new map from scratch. We improve this now by reading only new entries from cdc_streams_history and append them to the existing map. we can do this because we only append new entries to cdc_streams_history with higher timestamp than all previous entries. This makes this reloading more efficient and predictable, because previously we would read a number of entries that depends on the number of tablets splits and merges, which increases over time and is unbounded, whereas now we read only a single stream set on each update. Fixes scylladb/scylladb#26732	2025-10-28 08:54:09 +01:00
Wojciech Mitros	f07a86d16e	mv: limit concurrent view updates from all sources Before this patch, when a base table has many materialized views, each write to this table can start up to 128 view updates in parallel. With high client write concurrency, the actual concurrency of writes executed on the node may grow unexpectedly, which can lead to higher latency and higher memory usage compared to a sequential approach. In this patch we add a per-shard, per-service-level semaphore which limits the number of concurrent view updates processed on the shard in this service level to a constant value. We take one unit from the semaphore for each local view update write, and releasing it when it finishes. The remote view updates do not take units from the semaphore because they don't consume nearly as much processing power and they are limited by another semaphore based on their memory usage. The effect of this patch can also be observed when writing to a base table with a large number of materialized views, like in the materialized_views_test.py::TestMaterializedViews::test_many_mv_concurrent dtest. In that test, if we perform a full scan in parallel to a write workload with a concurrency of 100 to a table with 100 views, the scan would sometimes timeout because it would effectively get 1/10000 of cpu. With this patch, the cpu concurrency of view updates was limited to 128 (we ran both writes and scan in the same service level), and the scan no longer timed out. Fixes https://github.com/scylladb/scylladb/issues/25341	2025-10-27 18:55:41 +01:00
Pavel Emelyanov	81f598225e	error_injection: Add template parameter default for in release mode The std::optional<T> inject_parameter(...) method is a template, and in dev/debug modes this parameter is defaulted to std::string_view, but for release mode it's not. This patch makes it symmetrical. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26706	2025-10-27 16:39:22 +01:00
Taras Veretilnyk	1361ae7a0a	test_tablets2: verify SSTable cleanup after tablet load and stream Modify existing test_tablet_load_and_stream testcase to verify that SSTable files are properly deleted from the upload directory after streaming.	2025-10-27 16:36:08 +01:00
Taras Veretilnyk	517a4dc4df	tablet_sstable_streamer: replace unlink() call with mark_for_deletion() When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets. The previous implementation unlinked SSTables immediately after streaming them for the first tablet, potentially making them partially unavailable for subsequent tablets. This patches replaces unlink() call with mark_for_deletion()	2025-10-27 16:30:05 +01:00
Petr Gusev	478f7f545a	paxos_state: use shards_ready_for_reads Acquiring locks on both shards for the entire tablet migration period is redundant. In most cases, locking only the old shard or only the new shard is sufficient. Using shards_ready_for_reads reduces the situations in which we need to lock both shards to: * intra-node migrations only * only during the write_both_read_new state Once the global barrier completes in the write_both_read_new state, no requests remain on the old shard, so we can safely acquire locks only on the new shard. Fixes scylladb/scylladb#26727	2025-10-27 16:22:28 +01:00
Piotr Dulikowski	fd966ec10d	Merge 'cdc: garbage collect CDC streams for tablets' from Michael Litvak introduce helper functions that can be used for garbage collecting old cdc streams for tablets-based keyspaces. add a background fiber to the topology coordinator that runs periodically and checks for old CDC streams for tablets keyspaces that can be garbage collected. the garbage collection works by finding the newest cdc timestamp that has been closed for more than the configured cdc TTL, and removing all information from the cdc internal tables about cdc timestamps and streams up to this timestamp. in general it should be safe to remove information about these streams because they are closed for more than TTL, therefore all rows that were written to these streams with the configured TTL should be dead. the exception is if the TTL is altered to a smaller value, and then we may remove information about streams that still have live rows that were written with the longer ttl. Fixes https://github.com/scylladb/scylladb/issues/26669 Closes scylladb/scylladb#26410 * github.com:scylladb/scylladb: cdc: garbage collect CDC streams periodically cdc: helpers for garbage collecting old streams for tablets	2025-10-27 16:16:55 +01:00
Michał Hudobski	541b52cdbf	cql: fail with a better error when null vector is passed to ann query Currently when a null vector is passed to an ANN query we fail with a quite confusing error ("NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.0.0.1:9042 datacenter1>: <Error from server: code=0000 [Server error] message="to_bytes() called on raw value that is null">})"). This patch fixes that by throwing an InvalidRequestException with an appropriate message instead. We also add a test case that validates this behavior. Fixes: VECTOR-257 Closes scylladb/scylladb#26510	2025-10-27 16:09:08 +02:00
Botond Dénes	417270b726	Merge 'Port dtest EAR tests to test.py/pytest in scylla CI' from Calle Wilund Fixes #26641 * Adds shared abstraction for dockerized mock services for out pytests (not using python docker, due to both library and podman) * Adds test fixtures for our key providers (except GCS KMS, for which we have no mock server) to do local testing * Ports (and prunes and sharpens) the test cases from dtest::encryption_at_rest_test to our pytest. * Shared KMIP mock between boost test and pytest and speeds up boost test shutdown. When merged, the dtest counterpart can be decommissioned. Closes scylladb/scylladb#26642 * github.com:scylladb/scylladb: test::cluster::object_store::conftest: Make GS proxy use shared docker mock server wrapper test::cluster::test_encryption: Port dtest EAR tests test::cluster::conftest: Add key_provider fixture test::pylib::encryption_provider: Port dtest encryption provider classes test::pylib::dockerized_service: Add helper for running docker/podman test::pylib::kmip_wrapper: Modify to be usable by pytest fixtures test::boost::kmip_wrapper: Move python script for PyKMIP to pylib	2025-10-27 15:42:52 +02:00
Patryk Jędrzejczak	e1c3f666c9	Merge 'vnode cleanup: add missing barriers and fix race conditions' from Petr Gusev Problems addressed by this PR * Missing barrier before cleanup: If a node was bootstrapped before cleanup, some request coordinators could still be in `write_both_read_new` and send stale requests to replicas being cleaned up. * Sessions not drained before cleanup: We lacked protection against stale streaming or repair operations. * `sstable_vnodes_cleanup_fiber()` calling `flush_all_tables()` under group0 lock: This caused SCT test failures (see [this comment](https://github.com/scylladb/scylladb/issues/25333#issuecomment-3298859046) for details). * Issues with `storage_proxy::start_write()` used by `sstable_vnodes_cleanup_fiber`: * The result of `start_write()` was not held during `abstract_write_response_handler::apply_locally`, so coordinator-local writes were not properly awaited. * Synchronization was racy — `start_write()` was not atomic with the fence check, allowing stale writes to sneak in if `fence_version` changed in between. * It waited for all writes, including local tables and tablet-based tables, which is redundant because `sstable_vnodes_cleanup_fiber` does not apply to them. * It also waited for writes with versions greater than the current `fence_version`, which is unnecessary. Fixes scylladb/scylladb#26150 backport: this PR fixes several issues with the vnodes cleanup procedure, but it doesn't seem they are critical enough to deserve backporting Closes scylladb/scylladb#26315 * https://github.com/scylladb/scylladb: test_automatic_cleanup: add test_cleanup_waits_for_stale_writes test_fencing: fix due to new version increment test_automatic_cleanup: clean it up storage_proxy: wait for closing sessions in sstable cleanup fiber storage_proxy: rename await_pending_writes -> await_stale_pending_writes storage_proxy: use run_fenceable_write storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check storage_proxy: introduce run_fenceable_write storage_proxy: move update_fence_version from shared_token_metadata storage_proxy: fix start_write() operation scope in apply_locally storage_proxy: move post fence check into handle_write storage_proxy: move fencing into mutate_counter_on_leader_and_replicate storage_proxy::handle_read: add fence check before get_schema storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber sstable_cleanup_fiber: use coroutine::parallel_for_each storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock topology_coordinator: barrier before cleanup topology_coordinator: small start_cleanup refactoring global_token_metadata_barrier: add fenced flag	2025-10-27 12:35:13 +01:00
Petr Gusev	5ab2db9613	paxos_state: inline shards_for_writes into get_replica_lock No need to have two functions since both callers of get_replica_lock() use shards_for_writes() to compute the shards where the locks must be acquired. Also while at it, inline the acquire() lambda in get_replica_lock() and replace it with a loop over shards. This makes the code more strightforward.	2025-10-27 11:12:29 +01:00
Michael Litvak	6109cb66be	cdc: garbage collect CDC streams periodically add a background fiber to the topology coordinator that runs periodically and checks for old CDC streams for tablets keyspaces that can be garbage collected.	2025-10-26 11:01:20 +01:00
Michael Litvak	440caeabcb	cdc: helpers for garbage collecting old streams for tablets introduce helper functions that can be used for garbage collecting old cdc streams for tablets-based keyspaces. - get_new_base_for_gc: finds a new base timestamp given a TTL, such that all older timestamps and streams can be removed. - get_cdc_stream_gc_mutations: given new base timestamp and streams, builds mutations that update the internal cdc tables and remove the older streams. - garbage_collect_cdc_streams_for_table: combines the two functions above to find a new base and build mutations to update it for a specific table - garbage_collect_cdc_streams: builds gc mutations for all cdc tables	2025-10-26 11:01:20 +01:00
Avi Kivity	b843d8bc8b	Merge 'scylla-sstable: add cql support to write operation' from Botond Dénes In theory, scylla-sstable write is an awesome and flexible tool to generate sstables with arbitrary content. This is convenient for tests and could come clutch in a disaster scenario, where certain system table's content need to be manually re-created, system tables that are not writable directly via CQL. In practice, in its current form this operation is so convoluted to use that even its own author shuns it. This is because the JSON specification of the sstable content is the same as that of the scylla-sstable dump-data: containing every single piece of information on the mutation content. Where this is an advantage for dump-data, allowing users to inspect the data in its entirety -- it is a huge disadvantage for write, because of all these details have to be filled in, down to the last timestamp, to generate an sstable. On top of that, the tool doesn't even support any of the more advanced data types, like collections, UDF and counters. This PR proposes a new way of generating sstables: based on the success of scylla-sstable query, it introduces CQL support for scylla-sstable write. The content of the sstable can now be specified via standard INSERT, UPDATE and DELETE statements, which are applied to a memtable, then flushed into the sstable. To avoid boundless memory consumption, the memtable is flushed every time it reaches 1MiB in size, consequently the command can generate multiple output sstables. The new CQL input-format is made default, this is safe as nobody is using this command anyway. Hopefully this PR will change that. Fixes: https://github.com/scylladb/scylladb/issues/26506 New feature, no backport. Closes scylladb/scylladb#26515 * github.com:scylladb/scylladb: test/cqlpy/test_tools.py: add test for scylla-sstable write --input-format=cql replica/mutation_dump: add support for virtual tables tools/scylla-sstable: print_query_results_json(): handle empty value buffer tools/scylla-sstable: add cql support to write operation tools/scylla-sstable: write_operation(): fix indentation tools/scylla-sstable: write_operation(): prepare for a new input-format tools/scylla-sstable: generalize query_operation_validate_query() tools/scylla-sstable: move query_operation_validate_query() tools/scylla-sstable: extract schema transformation from query operation replica/table: add virtual write hook to the other apply() overload too	2025-10-24 23:32:40 +03:00
Avi Kivity	997b52440e	Merge 'replica/mutation_dump: include empty/dead partitions in the scan results' from Botond Dénes `select * from mutation_fragment()` queries don't return partitions which are completely empty or only contain tombstones which are all garbage collectible. This is because the underlying `mutation_dump` mechanism has a separate query to discover partitions for scans. This query is a regular mutation scan, which is subject to query compaction and garbage collection. Disable the query compaction for mutation queries executed on behalf of mutation fragment queries, so all data is visible in the result, even that which is fully garbage collectible. Fixes scylladb/scylladb#23707. Scans for mutation-fragment are very rare, so a backport is not necessary. We can backport on-demand. Closes scylladb/scylladb#26227 * github.com:scylladb/scylladb: replica/mutation_dump: multi_range_partition_generator: disable garbage-collection replica: add tombstone_gc_enabled parameter to mutation query methods mutation/mutation_compactor: remove _can_gc member tombstone_gc: add tombstone_gc_state factory methods for gc_all and no_gc	2025-10-24 23:26:16 +03:00
Patryk Jędrzejczak	5ae1aba107	test: unskip test_raft_recovery_entry_loss The issue has been fixed in #26612. Closes scylladb/scylladb#26614	2025-10-24 21:23:41 +03:00
Ferenc Szili	b4ca12b39a	load_stats: change data structure which contains tablet sizes This patch changes the tablet size map in load_stats. Previously, this data structure was: std::unordered_map<range_based_tablet_id, uint64_t> tablet_sizes; and is changed into: std::unordered_map<table_id, std::unordered_map<dht::token_range, uint64_t>> tablet_sizes; This allows for improved performance of tablet tablet size reconciliation.	2025-10-24 14:37:00 +02:00
Andrzej Jackowski	8642629e8e	test: add test_anonymous_user to test_raft_service_levels The primary goal of this test is to reproduce scylladb/scylladb#26040 so the fix (`278019c328`) can be backported to older branches. Scenario: connect via CQL as an anonymous user and verify that the `sl:default` scheduling group is used. Before the fix for #26040 `main` scheduling group was incorrectly used instead of `sl:default`. Control connections may legitimately use `sl:driver`, so the test accepts those occurrences while still asserting that regular anonymous queries use `sl:default`. This adds explicit coverage on master. After scylladb#24411 was implemented, some other tests started to fail when scylladb#26040 was unfixed. However, none of the tests asserted this exact behavior. Refs: scylladb/scylladb#26040 Refs: scylladb/scylladb#26581 Closes scylladb/scylladb#26589	2025-10-24 12:23:34 +02:00
Ernest Zaslavsky	e8ce49dadf	s3_client: remove unnecessary `co_await` in `make_request` Eliminates a redundant `co_await` by directly returning the `future`, simplifying the control flow without affecting behavior.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	71ea973ae4	s3 cleanup: remove obsolete retry-related classes Delete `default_retry_strategy` and `retryable_http_client`, no longer used in `s3_client` after recent refactors.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	d44bbb1b10	s3_client: remove unused `filler_exception` Eliminate the now-obsolete `filler_exception`, rendered redundant by earlier refactors that streamlined error handling in the S3 client.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	d3c6338de6	s3_client: fix indentation Fix indentation in background download fiber in `chunked_download_source`	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	47704deb1e	s3_client: simplify chunked download error handling using `make_request` Refactor `chunked_download_source` to eliminate redundant exception handling by leveraging the new `make_request` override with custom retry strategy. This streamlines the download fiber logic, improving readability and maintainability.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	2bc9b205b6	s3_client: reformat `make_request` functions for readability Reformats `make_request` functions with long argument lists to improve readability and comply with formatting guidelines.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	bf39412f4a	s3_client: eliminate duplication in `make_request` by using overload Removes redundant code in the `make_request` function by invoking the appropriate overload, simplifying logic and improving maintainability.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	695e70834e	s3_client: reformat `make_request` function declarations for readability Reformats the `make_request` function declarations to improve readability due to the large number of arguments. This aligns with our formatting guidelines and makes the code easier to maintain.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	9f01c1f3ff	s3_client: reorder `make_request` and helper declarations Performs minor reordering of helper functor declarations in the header file to improve readability and maintain logical grouping.	2025-10-23 15:58:10 +03:00
Ernest Zaslavsky	3d51124cb0	s3_client: add `make_request` override with custom retry and error handler Introduce an override for `make_request` in `s3_client` to support custom retry strategies and error handlers, enabling flexibility beyond the default client behavior and improving control over request handling	2025-10-23 15:58:10 +03:00
Ernest Zaslavsky	bdb3979456	s3_client: migrate s3_client to Seastar HTTP client Eliminate use of `retryable_http_client` in `s3_client` and adopt Seastar's native HTTP client.	2025-10-23 15:58:10 +03:00
Ernest Zaslavsky	2025760e75	s3_client: fix crash in `copy_s3_object` due to dangling stream In the `copy_part` method, move the `input_stream<char>` argument into a local variable before use. Failing to do so can lead to a SIGSEGV or trigger an abort under address sanitizer.	2025-10-23 15:58:10 +03:00
Ernest Zaslavsky	0983c791e9	s3_client: coroutinize `copy_s3_object` response callback coroutinize `copy_s3_object` response callback for a bugfix in the following commit to prevent failing on dangling stream	2025-10-23 15:58:10 +03:00
Ernest Zaslavsky	237217c798	aws_error: handle missing `unexpected_status_error` case Add a missing `case` clause to the `switch` statement to correctly handle scenarios where `unexpected_status_error` is thrown. This fixes overlooked error handling and improves robustness.	2025-10-23 15:58:10 +03:00
Ernest Zaslavsky	4f6384b1a0	s3_creds: use Seastar HTTP client with retry strategy In AWS credentials providers, replace `retryable_http_client` with Seastar's native HTTP client. Integrate the newly added `default_aws_retry_strategy` to handle retries more efficiently and reduce dependency on external retry logic.	2025-10-23 15:58:07 +03:00
Ernest Zaslavsky	3851ee58d7	retry_strategy: add exponential backoff to `default_aws_retry_strategy` Add exponential backoff to `default_aws_retry_strategy` and call it to `sleep` before returning `true`, no-op in case of non-retryable error	2025-10-23 15:49:34 +03:00
Ernest Zaslavsky	524737a579	retry_strategy: introduce Seastar-based retry strategy Add a new class derived from Seastar's `default_retry_strategy`. Relocate the `should_retry` implementation from Scylla's `default_retry_strategy` into the new class to centralize and standardize retry behavior.	2025-10-23 15:49:34 +03:00
Ernest Zaslavsky	51aadd0ab3	retry_strategy: update CMake and configure.py for new strategy Include `default_aws_retry_strategy` in the build system by updating CMake and `configure.py` to ensure it is properly compiled and linked.	2025-10-23 15:49:34 +03:00
Ernest Zaslavsky	5d65b47a15	retry_strategy: rename `default_retry_strategy` to `default_aws_retry_strategy` Renames the `default_retry_strategy` class to `default_aws_retry_strategy` to clarify its association with the S3 client implementation. This avoids confusion with the unrelated `seastar::default_retry_strategy` class.	2025-10-23 15:49:34 +03:00
Ernest Zaslavsky	cc200ced67	retry_strategy: fix include Fix header inclusion in "newly" created file	2025-10-23 15:49:34 +03:00
Ernest Zaslavsky	d679fd514c	retry_strategy: Copied utils/s3/retry_strategy.hh to utils/s3/default_aws_retry_strategy.hh	2025-10-23 15:49:34 +03:00
Ernest Zaslavsky	7cd4be4c49	retry_strategy: Copied utils/s3/retry_strategy.cc to utils/s3/default_aws_retry_strategy.cc	2025-10-23 15:49:34 +03:00
Aleksandra Martyniuk	6fc43f27d0	db: fix indentation	2025-10-23 10:39:43 +02:00
Aleksandra Martyniuk	1935268a87	test: add reproducer for data resurrection Add a reproducer to check that the repair_time isn't updated if the batchlog replay fails. If repair_time was updated, tombstones could be GC'd before the batchlog is replayed. The replay could later cause the data resurrection.	2025-10-23 10:39:43 +02:00
Aleksandra Martyniuk	d436233209	repair: fail tablet repair if any batch wasn't sent successfully If any batch replay failed, we cannot update repair_time as we risk the data resurrection. If replay of any batch needs to be retried, run the whole repair but fail at the very end, so that the repair_time for it won't be updated.	2025-10-23 10:39:42 +02:00
Aleksandra Martyniuk	e1b2180092	db/batchlog_manager: fix making decision to skip batch replay Currently, we skip batch replay if less than batch_log_timeout passed from the moment the batch was written. batch_log_timeout value can be configured. If it is large, it won't be replayed for a long time. If the tombstone will be GC'd before the batch is replayed, then we risk the data resurrection. To ensure safety we can skip only the batches that won't be GC'd. In this patch we skip replay of the batches for which: now() < written_at + min(timeout + propagation_delay) repair_time is set as a start of batchlog replay, so at the moment of the check we will have: repair_time <= now() So we know that: repair_time < written_at + propagation_delay With this condition we are sure that GC won't happen.	2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk	7f20b66eff	db: repair: throw if replay fails Return a flag determining whether all the batches were sent successfully in batchlog_manager::replay_all_failed_batches (batches skipped due to being too fresh are not counted). Throw in repair_flush_hints_batchlog_handler if not all batches were replayed, to ensure that repair_time isn't updated.	2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk	904183734f	db/batchlog_manager: delete batch with incorrect or unknown version batchlog_manager::replay_all_failed_batches skips batches that have unknown or incorrect version. Next round will process these batches again. Such batches will probably be skipped everytime, so there is no point in keeping them. Even if at some point the version becomes correct, we should not replay the batch - it might be old and this may lead to data resurrection.	2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk	502b03dbc6	db/batchlog_manager: coroutinize replay_all_failed_batches	2025-10-23 10:38:31 +02:00
Ernest Zaslavsky	abd3abc044	cmake: fix the seastar API level Fix the build to make it compile when using CMake by defining the right Seastar API level Closes scylladb/scylladb#26690	2025-10-23 11:20:20 +03:00
Botond Dénes	f8b0142983	Merge 'Add --drop-unfixable-sstables flag for scrub in segregate mode' from Taras Veretilnyk This PR introduces support for a new scrub option: `--drop-unfixable-sstables`, which enables the dropping of corrupted SSTables during scrub only in segregate mode. The patch includes implementation, validation, and set of tests to ensure correct behavior and error handling. Fixes #19060 Backport is not required, it is a new feature Closes scylladb/scylladb#26579 * github.com:scylladb/scylladb: sstable_compaction_test: add segregate mode tests for drop-unfixable-sstables option test/nodetool: add scrub drop-unfixable-sstables option testcase scrub: add support for dropping unfixable sstables in segregate mode	2025-10-23 11:06:19 +03:00
Wojciech Mitros	c0d0f8f85b	database: rename _view_update_concurrency_sem to _view_update_memory_sem In the following commit, we'll introduce a new semaphore for view updates that limits their concurrency by view update count. To avoid confusion, we rename the existing semaphore that tracks the memory used by concurrent view updates and related objects accordingly.	2025-10-23 10:00:15 +02:00
Tomasz Grabiec	564cebd0e6	Merge 'tablet_metadata_guard: fix split/merge handling' from Petr Gusev The guard should stop refreshing the ERM when the number of tablets changes. Tablet splits or merges invalidate the `tablet_id` field (`_tablet`), which means the guard can no longer correctly protect ongoing operations from tablet migrations. The problem is specific to LWT, since `tablet_metadata_guard` is used mostly for heavy topology operations, which exclude with split and merge. The guard was used for LWT as an optimization -- we don't need to block topology operations or migrations of unrelated tablets. In the future, we could use the guard for regular reads/writes as well (via the `token_metadata_guard` wrapper). Fixes [scylladb/scylladb#26437](https://github.com/scylladb/scylladb/issues/26437) backports: need to backport to 2025.4 since the bug is relevant to LWT over tablets. Closes scylladb/scylladb#26619 * github.com:scylladb/scylladb: test_tablets_lwt: add test_tablets_merge_waits_for_lwt test.py: add universalasync_typed_wrap tablet_metadata_guard: fix split/merge handling tablet_metadata_guard: add debug logs paxos_state: shards_for_writes: improve the error message storage_service: barrier_and_drain – change log level to info topology_coordinator: fix log message	2025-10-22 20:56:21 +02:00
Taras Veretilnyk	60334c6481	sstable_compaction_test: add segregate mode tests for drop-unfixable-sstables option Added a new test case, sstable_scrub_segregate_mode_drop_unfixable_sstables_test, which verifies that when the drop-unfixable-sstables flag is enabled in segregate mode, corrupted SSTables are correctly dropped.	2025-10-22 17:16:55 +02:00
Taras Veretilnyk	11874755a3	test/nodetool: add scrub drop-unfixable-sstables option testcase This patches introduces the test_scrub_drop_unfixable_sstables_option testcase, which verifies that correct request is generated when the --drop-unfixable-sstables flag is used. It also validates that an error is thrown if the drop-unfixable-sstables flag is enabled and mode is not set to SEGREGATE. This patch introduces test_scrub_drop_unfixable_sstables_option, which test	2025-10-22 17:16:55 +02:00
Taras Veretilnyk	42da7f1eb6	scrub: add support for dropping unfixable sstables in segregate mode This patch adds a new flag `drop-unfixable-sstables` to the scrub operation in segregate mode, allowing to automatically drop SSTables that cannot be fixed during scrub. It also includes API support of the 'drop_unfixable_sstables' paramater and validation to ensure this flag is not enabled in other modes rather than segragate.	2025-10-22 17:16:49 +02:00
Radosław Cybulski	621e88ce52	Fix spelling errors Closes scylladb/scylladb#26652	2025-10-22 16:46:31 +02:00
Petr Gusev	22271b9fe7	test_automatic_cleanup: add test_cleanup_waits_for_stale_writes	2025-10-22 16:31:43 +02:00
Petr Gusev	d1fc111dd7	test_fencing: fix due to new version increment Topology version is now bumped when a node finishes bootstrapping. As a result, fence_version == version - 1, and decrementing version in the test no longer triggers a stale topology exception. Fix: run cleanup_all to invoke the global barrier, which synchronizes fence_version := version on all nodes.	2025-10-22 16:31:43 +02:00
Petr Gusev	5bdeb4ec66	test_automatic_cleanup: clean it up Remove redundant imports and variables. Extract cleanup_all function. Add logs. Remove pytest.mark.prepare_3_racks_cluster -- the test doesn't actually need a 3 node cluster, one initial node is enough.	2025-10-22 16:31:43 +02:00
Petr Gusev	f34126aacf	storage_proxy: wait for closing sessions in sstable cleanup fiber Ensure that no stale streaming or repair sessions are active before proceeding with the cleanup.	2025-10-22 16:31:43 +02:00
Petr Gusev	7e2959a1bf	storage_proxy: rename await_pending_writes -> await_stale_pending_writes	2025-10-22 16:31:43 +02:00
Petr Gusev	1dd05f4404	storage_proxy: use run_fenceable_write Switch local write code sites from start_write() to run_fenceable_write().	2025-10-22 16:31:43 +02:00
Petr Gusev	d56495fd9c	storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check All mutation_holder::apply_locall() implementations now do the same post fence chech. In this commit we hoist this check up to abstract_write_response_handler::apply_locally().	2025-10-22 16:31:43 +02:00
Petr Gusev	24f8962938	storage_proxy: introduce run_fenceable_write This function is intended to replace start_write() in subsequent commits. It provides the following benefits: * Remove duplication: All start_write() call sites must run the fence check after the operation completes. run_fenceable_write() encapsulates this pattern. * Fix a race: To ensure no new stale write operations occur during cleanup, a fence check before start_write() was previously used. However, yields in several code paths between the check and start_write() made it non-atomic, allowing a stale operation to slip in if the fence_version was updated in between. * Optimize waiting: We do not need to wait for all operations—only for vnode-based, non-local tables with versions smaller than the current fence_version.	2025-10-22 16:31:43 +02:00
Petr Gusev	c5f447224a	storage_proxy: move update_fence_version from shared_token_metadata Future commits will extend update_fence_version, and it is simpler to do so if the function resides in storage_proxy. Additionally, fence_version is the only field this function accesses, and it is used solely within storage_proxy, making this change natural on its own.	2025-10-22 16:31:43 +02:00
Petr Gusev	659c5912e0	storage_proxy: fix start_write() operation scope in apply_locally The operation must be held during the local write. Before this commit, its scope ended after returning from apply_locally(), so it did not actually provide any protection.	2025-10-22 16:31:43 +02:00
Petr Gusev	27915befac	storage_proxy: move post fence check into handle_write handle_write() is invoked from receive_mutation_handler() and handle_paxos_learn(), and both previously performed a fence check in apply_fn. This commit hoists the fence check into handle_write() to reduce code duplication. Additionally, move start_write() after get_schema_for_write(), since there is no need to hold the operation while querying the schema.	2025-10-22 16:31:43 +02:00
Petr Gusev	41077138bf	storage_proxy: move fencing into mutate_counter_on_leader_and_replicate As noted in the code comments, start_write() does not need to be held during counter replication; it is required only while performing local storage modifications. Move the start_write() call and the fence check down to mutate_counter_on_leader_and_replicate(). Additionally, mutate_counters_on_leader() is updated to check for possible stale_topology_exception() and properly package them in the resulting exception_variant structure.	2025-10-22 16:31:43 +02:00
Petr Gusev	a6208b2d67	storage_proxy::handle_read: add fence check before get_schema Avoid querying the schema for outdated requests by adding a fence check at the start of handle_read.	2025-10-22 16:31:43 +02:00
Petr Gusev	263cbef68e	storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber The function applies only to vnode-based tables. Rename it to vnodes_cleanup_fiber for clarity.	2025-10-22 16:31:42 +02:00
Petr Gusev	03aa856da3	sstable_cleanup_fiber: use coroutine::parallel_for_each A refactoring commit -- no need to allocate a dedicated std::vector<future<>>.	2025-10-22 16:31:42 +02:00
Petr Gusev	4a781b67b5	storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock The flush_all_tables() call ensures that no obsolete, cleanup-eligible writes remain in the commitlog. This does not need to run under the group0 lock, so move it outside. Also, run await_pending_writes() before flush_all_tables(), since pending writes may include data that must be cleaned up. Finally, add more detailed info-level logs to trace the stages of the cleanup procedure.	2025-10-22 16:31:42 +02:00
Petr Gusev	a54ebe890b	topology_coordinator: barrier before cleanup Cleanup needs a barrier to make sure that no request coordinators are sending requests to old replicas/ranges that we're going to cleanup. For example, during node bootstrap, the cleanup process on replicas must be protected against coordinators running write_both_read_new and sending requests to old ranges. We run a barrier to ensure that most data-plane requests with the old topology finish before cleanup starts. At the same time, we do not want to block cleanup if the barrier fails on some replicas. Once the fence is committed to group0, we can safely proceed, since any late request with the old topology will be fenced out on the replica. The test for this case is added in a separate commit "test_automatic_cleanup: add test_cleanup_waits_for_stale_writes"	2025-10-22 16:31:42 +02:00
Petr Gusev	1b791dacde	topology_coordinator: small start_cleanup refactoring Rename start_cleanup -> start_vnodes_cleanup for clarity. Pass topology_request and server_id in start_vnodes_cleanup, we will need them for better logging later.	2025-10-22 16:31:42 +02:00
Petr Gusev	d53e24812f	global_token_metadata_barrier: add fenced flag Cleanup needs a barrier. For example, during node bootstrap, the cleanup process on replicas must be protected against coordinators running write_both_read_new and sending requests to old ranges. We run a barrier to ensure that most data-plane requests with the old topology finish before cleanup starts. At the same time, we do not want to block cleanup if the barrier fails on some replicas. Once the fence is committed to group0, we can safely proceed, since any late request with the old topology will be fenced out on the replica. To support this, introduce a "fenced" flag. The client can pass a pointer to a bool, which will be set to true after the new fenced_version is committed.	2025-10-22 16:31:42 +02:00
Calle Wilund	c4427f6d4f	test::cluster::object_store::conftest: Make GS proxy use shared docker mock server wrapper Use the shared logic of DockerizedServer to provide the fake-gcs-server docker helper.	2025-10-22 14:06:30 +00:00
Calle Wilund	1aa8014f8f	test::cluster::test_encryption: Port dtest EAR tests Moves the tests, reexamined and simplified, to unit tests instead of dtest.	2025-10-22 14:06:30 +00:00
Asias He	5f1febf545	repair: Remove the regular mode name in the tablet repair api The patch `e34deb72f9` (repair: Rename incremental mode name) missed one place that references the removed regular mode name. Fixes #26503 Closes scylladb/scylladb#26660	2025-10-22 16:55:55 +03:00
Botond Dénes	1c7f1f16c8	Merge 'raft topology: fix group0 tombstone GC in the Raft-based recovery procedure' from Patryk Jędrzejczak Group0 tombstone GC considers only the current group 0 members while computing the group 0 tombstone GC time. It's not enough because in the Raft-based recovery procedure, there can be nodes that haven't joined the current group 0 yet, but they have belonged to a different group 0 and thus have a non-empty group 0 state ID. The current code can cause a data resurrection in group 0 tables. We fix this issue in this PR and add a regression test. This issue was uncovered by `test_raft_recovery_entry_loss`, which became flaky recently. We skipped this test for now. We will unskip it in a following PR because it's skipped only on master, while we want to backport this PR. Fixes #26534 This PR contains an important bugfix, so we should backport it to all branches with the Raft-based recovery procedure (2025.2 and newer). Closes scylladb/scylladb#26612 * github.com:scylladb/scylladb: test: test group0 tombstone GC in the Raft-based recovery procedure group0_state_id_handler: remove unused group0_server_accessor group0_state_id_handler: consider state IDs of all non-ignored topology members	2025-10-22 16:40:11 +03:00
Ernest Zaslavsky	a09ec56e3d	cmake: fix `s3_test` linkage Fix missing `s3_test` executable linkage with `scylla_encryption` Closes scylladb/scylladb#26655	2025-10-22 14:14:43 +03:00
Anna Stuchlik	9c0ff7c46b	doc: add support for Debian 12 Fixes https://github.com/scylladb/scylladb/issues/26640 Closes scylladb/scylladb#26668	2025-10-22 14:09:13 +03:00
Calle Wilund	93e335f861	test::cluster::conftest: Add key_provider fixture Iterates test functions across all mockable providers and provides a key provider instance handling EAR setup.	2025-10-22 10:53:02 +00:00
Calle Wilund	6406879092	test::pylib::encryption_provider: Port dtest encryption provider classes Adds virtual interface for running scylla with EAR and various providers we can do mock for. Note: GCP KMS not implemented.	2025-10-22 10:53:02 +00:00
Petr Gusev	03d6829783	test_tablets_lwt: add test_tablets_merge_waits_for_lwt	2025-10-22 11:33:20 +02:00
Petr Gusev	33e9ea4a0f	test.py: add universalasync_typed_wrap The universalasync.wrap function doesn't preserve the type information, which confuses the VS Code Pylance plugin and makes code navigation hard. In this commit we fix the problem by adding a typed wrapped around universalasync.wrap. Fixes: scylladb/scylladb#26639	2025-10-22 11:32:37 +02:00
Petr Gusev	b23f2a2425	tablet_metadata_guard: fix split/merge handling The guard should stop refreshing the ERM when the number of tablets changes. Tablet splits or merges invalidate the tablet_id field (_tablet), which means the guard can no longer correctly protect ongoing operations from tablet migrations. Fixes scylladb/scylladb#26437	2025-10-22 11:32:37 +02:00
Petr Gusev	ec6fba35aa	tablet_metadata_guard: add debug logs	2025-10-22 11:32:37 +02:00
Petr Gusev	64ba427b85	paxos_state: shards_for_writes: improve the error message Add the current token and tablet info, remove 'this_shard_id' since it's always written by the logging infrastructure.	2025-10-22 11:32:37 +02:00
Petr Gusev	6f4558ed4b	storage_service: barrier_and_drain – change log level to info Debugging global barrier issues is difficult without these logs. Since barriers do not occur frequently, increasing the log level should not produce excessive output.	2025-10-22 11:32:37 +02:00
Petr Gusev	e1667afa50	topology_coordinator: fix log message	2025-10-22 11:32:37 +02:00
Nadav Har'El	895d89a1b7	Update seastar submodule Among other things, the merge includes the patch "http: add "Connection: close" header to final server response.". This Fixes #26298: A missing response header meant that a test's client code sometimes didn't notice that the server closed the connection (since the client didn't need to use the connection again), which made one test flaky. * seastar bd74b3fa...63900e03 (6): > Merge 'Rework output_stream::slow_write()' from Pavel Emelyanov output_stream: Fix indentation of the slow_write() method output_stream: Remove pointless else output_stream: Replace std::swap with std::exchange output_stream: Unify some code-paths of slow_write() > Merge 'Deprecate in/out streams move-assignment operator' from Pavel Emelyanov iostream: Deprecate input/output stream default constructor and move-assignment operator test: Sub-split test-cases test: Don't reuse output_stream in file demo test: Keep input_/output_stream as optional util: Construct file_data_source in with_file_input_stream() websocket: Construct in/out in initializer list rpc: Wrap socket and buffers > scripts/perftune.py: detect corrupted NUMA topology information > Merge 'memory, smp: support more than 256 shards' from Avi Kivity reactor, smp: allocate smp queues across all shards memory: increase maximum shard count memory: make cpu_id_shift and related mask dynamic resource, memory: move memory limit calculation to memory.cc resource: don't error if --overprovisioned and asking for more vcpus than available > Merge 'Update perf_test text output, make columns selectable' from Travis Downs perf_tests: enhance text output perf_test_tests: add some check_output tests	2025-10-22 11:26:40 +03:00
Nadav Har'El	7c9f5ef59e	Merge 'alternator/executor: instantly mark view as built when creating it with base table' from Michał Jadwiszczak `CreateTable` request creates GSI/LSI together with the base table, the base table is empty and we don't need to actually build the view. In tablet-based keyspaces we can just don't create view building tasks and mark the view build status as SUCCESS on all nodes. Then, the view building worker on each node will mark the view as built in `system.built_views` (`view_building_worker::update_built_views()`). Vnode-based keyspaces will use the "old" logic of view builder, which will process the view and mark it as built. Fixes scylladb/scylladb#26615 This fix should be backported to 2025.4. Closes scylladb/scylladb#26657 * github.com:scylladb/scylladb: test/alternator/test_tablets: add test for GSI backfill with tablets test/alternator/test_tablets: add reproducer for GSI with tablets alternator/executor: instantly mark view as built when creating it with base table	2025-10-22 10:44:28 +03:00
Calle Wilund	31cc1160b4	test::pylib::dockerized_service: Add helper for running docker/podman While there is a docker interface for python, need to deal with the docker-in-docker issues etc. This uses pure subprocess and stream parse. Meant to provide enough flexibility for all our docker mock server needs.	2025-10-21 23:26:50 +00:00
Avi Kivity	ab488fbb3f	Merge 'Switch to seastar API level 9 (no more packet-s in output_stream/data_sink API)' from Pavel Emelyanov Other than patching Scylla sinks to implement new data_sink_impl::put(std::span<temporary_buffer>) overload, the PR changes transport write_response() method to stop using output_stream::write(scattered_message) because it's also gone. Using newer seastar API, no need to backport Closes scylladb/scylladb#26592 * github.com:scylladb/scylladb: code: Fix indentation after previous patch code: Switch to seastar API level 9 transport: Open-code invoke_with_counting into counting_data_sink::put transport: Don't use scattered_message utils: Implement memory_data_sink::put(net::packet)	2025-10-22 01:51:43 +03:00
Michał Jadwiszczak	34503f43a1	test/alternator/test_tablets: add test for GSI backfill with tablets The test should pass without the fix for scylladb/scylladb#26615, because the `executor::updata_table()` uses `service::prepare_new_view_announcement()`, which creates view building tasks for the view. But it's better to add this test.	2025-10-22 00:34:49 +02:00
Michał Jadwiszczak	bdab455cbb	test/alternator/test_tablets: add reproducer for GSI with tablets	2025-10-22 00:34:10 +02:00
Andrei Chekun	24d17c3ce5	test.py: rewrite the wait_for_first_completed Rewrite wait_for first_completed to return only first completed task guarantee of awaiting(disappearing) all cancelled and finished tasks Use wait_for_first_completed to avoid false pass tests in the future and issues like #26148 Use gather_safely to await tasks and removing warning that coroutine was not awaited Closes scylladb/scylladb#26435	2025-10-22 01:13:43 +03:00
Takuya ASADA	eb30594a60	dist: detect corrupted NUMA topology information There are some environment which has corrupted NUMA topology information, such as some instance types on AWS EC2 with specific Linux kernel images. On such environment, we cannot get HW information correctly from hwloc, so we cannot proceed optimization on perftune. To avoid causing script error, check NUMA topology information and skip running perftune if the information corrupted. Related scylladb/seastar#2925 Closes scylladb/scylladb#26344	2025-10-22 01:11:14 +03:00
Michał Jadwiszczak	8fbf122277	alternator/executor: instantly mark view as built when creating it with base table `CreateTable` request creates GSI/LSI together with the base table, the base table is empty and we don't need to actually build the view. In tablet-based keyspaces we can just don't create view building tasks and mark the view build status as SUCCESS on all nodes. Then, the view building worker on each node will mark the view as built in `system.built_views` (`view_building_worker::update_built_views()`). Vnode-based keyspaces will use the "old" logic of view builder, which will process the view and mark it as built. Fixes scylladb/scylladb#26615	2025-10-22 00:05:40 +02:00
Avi Kivity	029513bee9	Merge 'storage_proxy: wait for write handlers destruction' from Petr Gusev `shared_ptr<abstract_write_response_handler>` instances are captured in the `lmutate` and `rmutate` lambdas of `send_to_live_endpoints()`. As a result, an `abstract_write_response_handler` object may outlive its removal from the `storage_proxy::_response_handlers` map -> `cancel_all_write_response_handlers()` doesn't actually wait for requests completion -> `sp::drain_on_shutdown()` doesn't guarantee all requests are drained -> `sp::stop_remote()` completes too early and `paxos_store` is destroyed while LWT local writes might still be in progress. In this PR we introduce a `write_handler_destroy_promise` to wait for such pending instances in `cancel_write_handlers()` and `cancel_all_write_response_handlers()` to prevent the `use-after-free`. A better long-term solution might be to replace `shared_ptr` with `unique_ptr` for `abstract_write_response_handler` and use a separate gate to track the `lmutate/rmutate` lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in `~abstract_write_response_handler`. Fixes scylladb/scylladb#26355 backport: need to be backported to 2025.4 since #26355 is reproduced on LWT over tablets Closes scylladb/scylladb#26408 * github.com:scylladb/scylladb: test_tablets_lwt: add test_lwt_shutdown storage_proxy: wait for write handler destruction storage_proxy: coroutinize cancel_write_handlers storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler	2025-10-22 00:02:08 +03:00
Michał Hudobski	5c957e83cb	vector_search: remove dependence on cql3 This patch removes the dependence of vector search module on the cql3 module by moving the contents of cql3/type_json.hh to types/json_utils.hh and removing the usage of cql3 primary_key object in vector_store_client. We also make the needed adjustments to files that were previously using the afformentioned type_json.hh file. This fixes the circular dependency cql3 <-> vector_search. Closes scylladb/scylladb#26482	2025-10-21 17:41:55 +03:00
Michael Litvak	35711a4400	test: cdc: test cdc compatible schema Add a simple test verifying our changes for the compatible CDC schema. The test checks we can write to a table with CDC enabled after ALTER and after node restart.	2025-10-21 14:14:34 +02:00
Michael Litvak	448e14a3b7	cdc: use compatiable cdc schema in the CDC log transformer, when augmenting a base mutation, use the CDC log schema that is compatible with the base schema, if set. Now that the base schema has a pointer to its CDC schema, we can use it instead of getting the current schema from the db, which may not be compatible with the base schema. The compatible CDC schema may not be set if the cluster is not using raft mode for schema. In this case, we maintain the previous behavior.	2025-10-21 14:14:33 +02:00
Michael Litvak	6e2513c4d2	db: schema_applier: create schema with pointer to CDC schema When creating a schema for a non-CDC table in the schema_applier, find its CDC schema that we created previously in the same operation, if any, and create the schema with a pointer to the CDC schema. We use the fact that for a base table with CDC enabled, its CDC schema is created or altered together in the same group0 operation. Similarly, in schema_tables, when creating table schemas from the schema tables, first create all schemas that don't have CDC enabled, then create schemas that have CDC enabled by extending them with the pointer to the CDC schema that we created before. There are few additional cases where we create schemas that we need to consider how to handle. When loading a schema from schema tables in the schema_loader we decide not to set the CDC schema, because this schema is mostly used for tools and it's not used for generating CDC mutations. When transporting a schema by RPC in the migration manager, we don't transport its CDC schema, and we always set it to null. Because we use raft we expect this shouldn't have any effect, because the schema is synchronized through raft and not through the RPC.	2025-10-21 14:13:43 +02:00
Michael Litvak	4fe13c04a9	db: schema_applier: extract cdc tables Previously in the schema applier we have two maps of schema_mutations, for tables and for views. Now create another map for CDC tables by extracting them from the non-views tables map. We maintain the previous behavior by applying each operation that's done on the tables map, to the CDC map as well. Later we will want to handle CDC and non-CDC tables differently. We want to be able to create all CDC schemas first, so when we create the non-CDC tables we can create them with a pointer to their CDC schemas.	2025-10-21 14:13:43 +02:00
Michael Litvak	ac96e40f13	schema: add pointer to CDC schema Add to the schema object a member that points to the CDC schema object that is compatible with this schema, if any. The compatible CDC schema is created and altered with its base schema in the same group0 operation. When generating CDC log mutations for some base mutation we want them to be created using a compatible schema thas has a CDC column corresponding to each base column. This change will allow us to find the right CDC schema given a base mutation. We also update the relevant structures in the schema registry that are related to learning about schemas and transporting schemas across shards or nodes. When transporting a schema as frozen_schema, we need to transport the frozen cdc schema as well, and set it again when unfreezing and reconstructing the schema. When adding a schema to the registry, we need to ensure its CDC schema is added to the registry as well. Currently we always set the CDC schema to nullptr and maintain the previous behavior. We will change it in a later commit. Until then, we mark all places where CDC schema is passed clearly so we don't forget it.	2025-10-21 14:13:43 +02:00
Michael Litvak	60f5c93249	schema_registry: remove base_info from global_schema_ptr remove the _base_info member from global_schema_ptr, and used the base_info we have stored in the schema registry entry instead. Currently when constructing a global_schema_ptr from a schema_ptr it extracts and stores the base_info from the schema_ptr. Later it uses it to reconstruct the schema_ptr, together with the frozen schema from the schema registry entry. But we can use the base_info that is already stored in the schema registry entry.	2025-10-21 14:13:43 +02:00
Michael Litvak	085abef05d	schema_registry: use extended_frozen_schema in schema load Change the schema loader type in the schema_registry to return a extended_frozen_schema instead of view_schema_and_base_info, and remove view_schema_and_base_info which is not used anymore. The casting between them is trivial.	2025-10-21 14:13:43 +02:00
Michael Litvak	8c7c1db14b	schema_registry: replace frozen_schema+base_info with extended_frozen_schema The schema_registry_entry holds a frozen_schema and a base_info. The base_info is extracted from the schema_ptr on load of a schema_ptr, and it is used when unfreezing the schema. But this is exactly what extended_frozen_schema is doing, so we can just store an object of this type in the schema_registry_entry. This makes the code simpler because the schema registry doesn't need to be aware of the base_info.	2025-10-21 14:13:43 +02:00
Michael Litvak	278801b2a6	frozen_schema: extract info from schema_ptr in the constructor Currently we construct a frozen schema with base info in few places, and the caller is responsible for constructing the frozen schema and extracting the base info if it's a view table. We change it to make it simpler and remove the burden from the caller. The caller can simply pass the schema_ptr, and the constructor for extended_frozen_schema will construct the frozen schema and extract the additional info it needs. This will make it easier to add additional fields, and reduces code duplication. We also make temporary castings between extended_frozen_schema and view_schema_and_base_info for the transition, which are trivial, until they are combined to a single type.	2025-10-21 14:13:42 +02:00
Michael Litvak	154d5c40c8	frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema This commit starts a series of refactoring commits of the frozen_schema to reduce duplication and make it easier to extend. Currently there are two essentially identical types, frozen_schema_with_base_info and view_schema_and_base_info in the schema_registry that hold a frozen_schema together with a base_info for view schemas. Their role is to pass around a frozen schema together with additional info that is extracted from the schema and passed around with it when transporting it across shards or nodes, and is needed for reconstructing it, and it is not part of the schema mutations. Our goal is to combine them to a single type that we will call extended_frozen_schema.	2025-10-21 14:13:42 +02:00
Emil Maskovsky	cf93820c0a	test/cluster: fix missing await in test_group0_tombstone_gc The recursive call to alter_system_schema() was missing the await keyword, which meant the coroutine was never actually executed and the test wasn't doing what it was supposed to do. Not backporting: Test fix only. Closes scylladb/scylladb#26623	2025-10-21 11:22:39 +02:00
Calle Wilund	91db8583f8	test::pylib::kmip_wrapper: Modify to be usable by pytest fixtures Add `serve` impl that does not mess with signals, and shutdown that does not mess with threads. Also speed up standalone shutdown to make boost tests less slow.	2025-10-21 09:01:55 +00:00
Calle Wilund	772bd856e2	test::boost::kmip_wrapper: Move python script for PyKMIP to pylib Prepare for re-use in python tests as well as boost ones.	2025-10-21 09:01:54 +00:00
Avi Kivity	0ed178a01e	build: disable the -fextend-variable-liveness clang option In clang 21, the -fextend-variable-liveness option was made default [1] with -Og. It helps reduce "optimized out" problems while debugging. However, it conflicts [2] with coroutines. To prevent problems during the upgrade to Clang 21, disable the option. [1] `36af7345df` [2] https://github.com/llvm/llvm-project/issues/163007 Closes scylladb/scylladb#26573	2025-10-21 10:47:34 +03:00
Botond Dénes	fbceb8c16b	Merge 's3_client: handle failures which require http::request updating' from Ernest Zaslavsky Apply two main changes to the s3_client error handling 1. Add a loop to s3_client's `make_request` for the case whe the retry strategy will not help since the request itself have to be updated. For example, authentication token expiration or timestamp on the request header 2. Refine the way we handle exceptions in the `chunked_download_source` background fiber, now we carry the original `exception_ptr` and also we wrap EVERY exception in `filler_exception` to prevent retry strategy trying to retry the request altogether Fixes: https://github.com/scylladb/scylladb/issues/26483 Should be ported back to 2025.3 and 2025.4 to prevent deadlocks and failures in these versions Closes scylladb/scylladb#26527 * github.com:scylladb/scylladb: s3_client: tune logging level s3_client: add logging s3_client: improve exception handling for chunked downloads s3_client: fix indentation s3_client: add max for client level retries s3_client: remove `s3_retry_strategy` s3_client: support high-level request retries s3_client: just reformat `make_request` s3_client: unify `make_request` implementation	2025-10-21 10:40:38 +03:00
Botond Dénes	c543059f86	Merge 'Synchronize tablet split and load-and-stream' from Raphael Raph Carvalho Load-and-stream is broken when running concurrently to the finalization step of tablet split. Consider this: 1) split starts 2) split finalization executes barrier and succeed 3) load-and-stream runs now, starts writing sstable (pre-split) 4) split finalization publishes changes to tablet metadata 5) load-and-stream finishes writing sstable 6) sstable cannot be loaded since it spans two tablets two possible fixes (maybe both): 1) load-and-stream awaits for topology to quiesce 2) perform split compaction on sstable that spans both sibling tablets This patch implements # 1. By awaiting for topology to quiesce, we guarantee that load-and-stream only starts when there's no chance coordinator is handling some topology operation like split finalization. Fixes https://github.com/scylladb/scylladb/issues/26455. Closes scylladb/scylladb#26456 * github.com:scylladb/scylladb: test: Add reproducer for l-a-s and split synchronization issue sstables_loader: Synchronize tablet split and load-and-stream	2025-10-21 09:43:38 +03:00
Tomasz Grabiec	ba692d1805	schema_tables: Keep "replication" column backwards-compatible by expanding rack lists to numeric RF In `380f243986` we added support for rack lists in replication options. Drivers which are not prepared to parse that (as of now, all of them), will not create metadata object for that keyspace. This breaks, for example, the "copy to/from" cqlsh command. Potentially other things too. To fix that, keep the "replication" column in the old format, and store numeric RF there, which corresponds to the number of replicas. Accurate options in the new format are put in "replication_v2". We set replication_v2 in the schema only when it differs from the old "replication" so that the new column is not set during upgrade, otherwise downgrade would fail. Partition tombstone is added to ensure that pre-alter replication_v2 value is deleted on alters which change replication to a value which is the same as the post-alter "replication" value. Fixes #26415 Closes scylladb/scylladb#26429	2025-10-21 09:11:25 +03:00
Tomasz Grabiec	e4e79be295	Merge 'tablet_allocator: allow merges in base tables if rf-rack-valid=true' from Piotr Dulikowski Tablet merge of base tables is only safe if there is at most one replica in each rack. For more details on why it is the case please see scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on, this condition is satisfied, so allow it in that case. Fixes: scylladb/scylladb#26273 Marked for backport to 2025.4 as MVs are getting un-experimentaled there. Closes scylladb/scylladb#26278 * github.com:scylladb/scylladb: test: mv: add a test for tablet merge tablet_allocator, tests: remove allow_tablet_merge_with_views injection tablet_allocator: allow merges in base tables if rf-rack-valid=true	2025-10-21 00:18:30 +02:00
Raphael S. Carvalho	4654cdc6fd	test: Add reproducer for l-a-s and split synchronization issue Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-10-20 19:17:25 -03:00
Raphael S. Carvalho	3abc66da5a	sstables_loader: Synchronize tablet split and load-and-stream Load-and-stream is broken when running concurrently to the finalization step of tablet split. Consider this: 1) split starts 2) split finalization executes barrier and succeed 3) load-and-stream runs now, starts writing sstable (pre-split) 4) split finalization publishes changes to tablet metadata 5) load-and-stream finishes writing sstable 6) sstable cannot be loaded since it spans two tablets two possible fixes (maybe both): 1) load-and-stream awaits for topology to quiesce 2) perform split compaction on sstable that spans both sibling tablets This patch implements #1. By awaiting for topology to quiesce, we guarantee that load-and-stream only starts when there's no chance coordinator is handling some topology operation like split finalization. Fixes #26455. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-10-20 19:17:22 -03:00
Piotr Dulikowski	f76917956c	view_building_worker: access tablet map through erm on sstable discovery Currently, the data returned by `database::get_tables_metadata()` and `database::get_token_metadata()` may not be consistent. Specifically, the tables metadata may contain some tablet-based tables before their tablet maps appear in the token metadata. This is going to be fixed after issue scylladb/scylladb#24414 is closed, but for the time being work around it by accessing the token metadata via `table`->effective_replication_map() - that token metadata is guaranteed to have the tablet map of the `table`. Fixes: scylladb/scylladb#26403 Closes scylladb/scylladb#26588	2025-10-21 00:14:39 +02:00
Petr Gusev	8925f31596	test_tablets_lwt: add test_lwt_shutdown	2025-10-20 20:16:09 +02:00
Petr Gusev	bbcf3f6eff	storage_proxy: wait for write handler destruction shared_ptr<abstract_write_response_handler> instances are captured in the lmutate/rmutate lambdas of send_to_live_endpoints(). As a result, an abstract_write_response_handler object may outlive its removal from the _response_handlers map. We use write_handler_destroy_promise to wait for such pending instances in cancel_write_handlers() and cancel_all_write_response_handlers() to prevent use-after-free. A better long-term solution might be to replace shared_ptr with unique_ptr for abstract_write_response_handler and use a separate gate to track the lmutate/rmutate lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in ~abstract_write_response_handler. Fixes scylladb/scylladb#26355	2025-10-20 20:10:42 +02:00
Petr Gusev	b269f78fa6	storage_proxy: coroutinize cancel_write_handlers The cancel_write_handlers() method was assumed to be called in a thread context, likely because it was first used from gossiper events, where a thread context already existed. Later, this method was reused in abort_view_writes() and abort_batch_writes(), where threads are created on the fly and appear redundant. The drain_on_shutdown() method also used a thread, justified by some "delicate lifetime issues", but it is unclear what that actually means. It seems that a straightforward co_await should work just fine.	2025-10-20 19:49:02 +02:00
Petr Gusev	bf2ac7ee8b	storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler A strong pointer was held for the duration of thread::yield(), preventing abstract_write_response_handler destruction and possibly delaying the sending of timeout or error responses to the client. This commit removes the strong pointer. Instead, we compute the next iterator before calling timeout_cb(), so if the handler is destroyed inside timeout_cb(), we already have a valid next iterator.	2025-10-20 19:49:02 +02:00
Piotr Wieczorek	a3ec6c7d1d	alternator/streams: Support userIdentity field for TTL deletions UserIdentity is a map of two fields in GetRecords responses, which always has the same value. It may be missing, or contain a constant object with value `{"type": "Service", "principalId": "dynamodb.amazonaws.com"}`. Currently, the latter is set only for `REMOVE`s triggered by TTL. This commit introduces two new CDC operation types: `service_row_delete` and `service_partition_delete`, emitted in place of `row_delete` and `partition_delete`. Alternator Streams treats them as regular `REMOVE`s, but in addition adds the `userIdentity` field to the record. This change may break existing Scylla libraries for reading raw CDC tables, but we doubt that anybody has this use case. Refs https://github.com/scylladb/scylladb/pull/26149 Refs https://github.com/scylladb/scylladb/pull/26121 Fixes https://github.com/scylladb/scylladb/issues/11523 Closes scylladb/scylladb#26460	2025-10-20 17:15:59 +02:00
Nadav Har'El	eb06ace944	Merge 'auth: implement vector store authorization' from Michał Hudobski This patch implements the changes required by the Vector Store authorization, as described in https://scylladb.atlassian.net/wiki/spaces/RND/pages/107085899/Vector+Store+Authentication+And+Authorization+To+ScyllaDB, that is: - adding a new permission VECTOR_SEARCH_INDEXING, grantable only on ALL KEYSPACES - allowing users with that permission to perform SELECT queries, but only on tables with a vector index - increasing the number of scheduling groups by one to allow users to create a service level for a vector store user - adjusting the tests and documentation These changes are needed, as the vector indexes are managed by the external service, Vector Store, which needs to read the tables to create the indexes in its memory. We would like to limit the privileges of that service to a minimum to maintain the principle of least privilege, therefore a new permission, one that allows the SELECTs conditional on the existence of a vector_index on the table. Fixes: VECTOR-201 Backport reasoning: Backport to 2025.4 required as this can make upgrading clusters more difficult if we add it in 2026.1. As for now Scylla Cloud requires version 2025.4 to enable vector search and permission is set by orchestrator so there is no chance that someone will try to add this permission during upgrade. In 2026.1 it will be more difficult. Closes scylladb/scylladb#25976 * github.com:scylladb/scylladb: docs: adjust docs for VS auth changes test: add tests for VECTOR_SEARCH_INDEXING permission cql: allow VECTOR_SEARCH_INDEXING users to select auth: add possibilty to check for any permission in set auth: add a new permission VECTOR_SEARCH_INDEXING	2025-10-20 17:32:00 +03:00
Ernest Zaslavsky	fdd0d66f6e	s3_client: tune logging level Change all logging related to errors in `chunked_download_source` background download fiber to `info` to make it visible right away in logs.	2025-10-20 17:12:59 +03:00
Ernest Zaslavsky	4497325cd6	s3_client: add logging Add logging for the case when we encounter expired credentials, shouldnt happen but just in case	2025-10-20 17:12:59 +03:00
Ernest Zaslavsky	1d34657b14	s3_client: improve exception handling for chunked downloads Refactor the wrapping exception used in `chunked_download_source` to prevent the retry strategy from reattempting failed requests. The new implementation preserves the original `exception_ptr`, making the root cause clearer and easier to diagnose.	2025-10-20 17:12:59 +03:00
Ernest Zaslavsky	58a1cff3db	s3_client: fix indentation Reformat `client::make_request` to fix the indentation of `if` block	2025-10-20 17:12:59 +03:00
Ernest Zaslavsky	43acc0d9b9	s3_client: add max for client level retries To prevent client retrying indefinitely time skew and authentication errors add `max_attempts` to the `client::make_request`	2025-10-20 17:12:59 +03:00
Ernest Zaslavsky	116823a6bc	s3_client: remove `s3_retry_strategy` It never worked as intended, so the credentials handling is moving to the same place where we handle time skew, since we have to reauthenticate the request	2025-10-20 17:12:59 +03:00
Ernest Zaslavsky	185d5cd0c6	s3_client: support high-level request retries Add an option to retry S3 requests at the highest level, including reinitializing headers and reauthenticating. This addresses cases where retrying the same request fails, such as when the S3 server rejects a timestamp older than 15 minutes.	2025-10-20 17:12:59 +03:00
Dawid Mędrek	7e201eea1a	index: Set tombstone_gc when creating secondary index Before this commit, when the underlying materialized view was created, it didn't have the property `tombstone_gc` set to any value. That was a bug and we fix it now. Two reproducer tests is added for validation. They reproduce the problem and don't pass before this commit. Fixes scylladb/scylladb#26542	2025-10-20 14:04:45 +02:00
Dawid Mędrek	e294b80615	index: Make `create_view_for_index` method of `create_index_statement`	2025-10-20 14:04:16 +02:00
Dawid Mędrek	fe00485491	index: Move code for creating MV of secondary index to cql3 We move the code responsible for creating the schema for the underlying materialized view of a secondary index from `index/` to `cql3/` so that it's close to that responsible for performing `CREATE INDEX`. That's in line with how other CQL statements are designed. Note that the moved method is still a method of `secondary_index_manager`. We'll make it a method of `create_index_statement` in the following commit.	2025-10-20 14:04:11 +02:00
Dawid Mędrek	20761b5f13	db, cql3: Move creation of underlying MV for index The main goal of this patch is to give more control over the creation of the underlying view on an index to `create_index_statement.cc`. That goal is in line with how the other statements are executed: the schema is built in the cql3 module and only the ready schema_ptr is passed further. That should also make the code cleaner and easier to understand. There are a few important things to note here: * A call to `service::prepare_new_view_announcement` appears out of nowhere. Aside from some validation checks and logging, that function does pretty much the same as the pre-existing code we remove: a. It creates Raft mutations based on the passed `view_ptr`. b. It creates Raft mutations responsible for view building tasks. c. It notifies about a new column family. * We seemingly get rid of the code that creates view building tasks. That's not true: we still do that via `service::prepare_new_view_announcement`. That should explain why the change doesn't remove any relevant logic. On the other hand, it might be more difficult to explain why moving the code is correct. I'll touch on it below. Before that, it may also be important to highlight that this commit only affects the logic responsible for creating an index. There should be no effect on any other part of how Scylla behaves. --- Proving the correctness of the solution would take quite a lot of space, so I'll only summarize it. It relies on a few things: 1. Two schema changes cannot happen in one operation. We allow for more but only when those changes are dependent on each other and when the additional ones are internal for Scylla, e.g. creating an index leads to creating the underlying materialized view. 2. There are no entities or components that rely on indexes. 3. Each index is uniquely defined by the keyspace it belongs to and the name of the index. 4. There is a bijection between rows in `system_schema.indexes` and the currently existing indexes. 5. The name of an unnamed index depends on the name of the base table and the names of the indexed columns. The name of an unnamed index may have a number attached to it, but that number only depends on the state of the schema at the time of creation of the index, and it never changes later on. There are no other things the name of an unnamed index depends on. 6. Scylla doesn't allow for changing any column in the base table that has an index depending on it. Based on that, we conclude that every existing index has exactly one entry in `system_schema.indexes`, and the primary key of that entry never changes. The columns of `system_schema.indexes` that are not part of the primary key are: `kind` and `options`. Both values are only decided at the time of creation of an index, and currently there's no way to modify them. That implies that there are only two events when an entry in the system table can change: when creating an index and when dropping an index. --- When we consider the previous place of the logic that this commit moves to `cql3/statements/create_index_statement.cc`, it works like this: 1. We compare the sets of indexes defined on a specific table (in the form of a structure called `index_metadata`) before and after an operation. 2. We divide the entries into three sets: those present in both sets and those present in only one of them. 3. We handle each of those three sets separately. The structure `index_metadata` is a reflection of entries in `system_schema.indexes`. It stores one more parameter -- `local` -- but its value depends on the other values of an entry, so we can ignore it in this reasoning. Because an index cannot be modified -- it can only be created or dropped -- there are at most two non-empty sets: the set of new indexes and the set of dropped indexes. Those sets are only non-empty during an operation like `CREATE INDEX`, `DROP INDEX`, `DROP TABLE (base table)`, `DROP KEYSPACE`. Note that it's impossible to drop an index by dropping the underlying materialized view -- Scylla doesn't allow for that. However, the code in `migration_manager.cc` we call (`prepare_column_family_update_announcement`) and the code that we call in `schema_tables.cc` (`make_update_table_mutations`) is only triggered by updates related to the base table. In the context of `DROP TABLE` or `DROP KEYSPACE`, we'd call `prepare_column_family_drop_announcement` instead. In other words, we're only concerned with `CREATE INDEX` and `DROP INDEX`. --- A conclusion from this reasoning is that we only need to consider those two situations when talking about correctness of this change. The impact of this commit is that we may have potentially reordered mutations in the resulting vector that will be applied to the Raft log. The only mutations we may have reordered are the mutations responsible for creating the underlying view and the mutations responsible for updating columns in the base table. It's clear then that this commit brings no change at all: we only give `cql3/statements/create_index_statement.cc` more control over creating the underlying view. --- We leave a remnant of the code in `db/schema_tables.cc` responsible for dropping an index along with its underlying view. It would require changing a bit more of the logic, and we don't need it for the rest of this sequence of changes. Refs scylladb/scylladb#16454	2025-10-20 14:04:06 +02:00
Łukasz Paszkowski	7ec369b900	database: Log message after critical_disk_utilization mode is set This is a follow-up of the previous fix: https://github.com/scylladb/scylladb/pull/26030 The test test_user_writes_rejection starts a 3-node cluster and creates a large file on one of the nodes, to trigger the out-of-space prevention mechanism, which should reject writes on that node. It waits for the log message 'Setting critical disk utilization mode: true' and then executes a write expecting the node to reject it. Currently, the message is logged before the `_critical_disk_utilization` variable is actually updated. This causes the test to fail sporadically if it runs quickly enough. The fix splits the logging into two steps: 1. "Asked to set critical disk utilization mode" - logged before any action 2) "Set critical disk utilization mode" - logged after `_critical_disk_utilization` has been updated The tests are updated to wait for the second message. Fixes https://github.com/scylladb/scylladb/issues/26004 Closes scylladb/scylladb#26392	2025-10-20 13:24:10 +03:00
Asias He	33bc1669c4	repair: Fix uuid and nodes_down order in the log Fixes #26536 Closes scylladb/scylladb#26547	2025-10-20 13:21:59 +03:00
Pavel Emelyanov	44ed3bbb7c	Merge 'RFC: Initial GCP storage backend for scylla (sstables + backup)' from Calle Wilund Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage. Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers. This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend. Similarly with storage_options. Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc). Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends. Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake. Fixes #25359 Fixes #26453 Closes scylladb/scylladb#26186 * github.com:scylladb/scylladb: docs::dev::object_storage: Add some initial info on GS storage docs/dev: Add mention of (nested) docker usage in testing.md sstables::object_storage_client: Forward memory limit semaphore to GS instance utils::gcp::object_storage: Add optional memory limits to up/download sstables::object_storage_client: Add multi-upload support for GS utils::gcp::storage: Add merge objects operation test_backup/test_basic: Make tests multiplex both s3 and gs backends test::cluster::conftest: Add support for multiple object storage backends boost::gcs_storage_test: reindent boost::gcs_storage_test: Convert to use fixture tests::boost: Add GS object storage cases to mirror S3 ones tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env sstables::object_storage_client: Add google storage implementation test_services: Allow testing with GS object storage parameters utils::gcp::gcp_credentials: Add option to create uninitialized credentials utils::gcp::object_storage: Make create_download_source return seekable_data_source utils::gcp::object_storage: Add defensive copies of string_view params utils::gcp::object_storage: Add missing retry backoff increate utils::gcp::object_storage: Add timestamp to object listing utils::gcp::object_storage: Add paging support to list_objects object_storage_client: Add object_name wrapper type utils::gcp::object_storage: Add optional abort_source utils::rest::client: Add abort_source support sstables: Use object_storage_client for remote storage sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial) s3::upload_progress: Promote to general util type storage_options: Abstract s3 to "object_storage" and add gs as option sstables::file_io_extension: Change "creator" callback to just data_source utils::io-wrappers: Add ranged data_source utils::io-wrappers: Add file wrapper type for seekable_source utils::seekable_source: Add a seekable IO source type object_storage_endpoint_param: Add gs storage as option config: break out object_storage_endpoint_param preparing for multi storage	2025-10-20 13:14:53 +03:00
Patryk Jędrzejczak	c57f097630	test: test group0 tombstone GC in the Raft-based recovery procedure We add a regression test for the bug fixed in the previous commits.	2025-10-20 12:05:11 +02:00
Patryk Jędrzejczak	6b2e003994	group0_state_id_handler: remove unused group0_server_accessor It became unused in the previous commit.	2025-10-20 12:05:11 +02:00
Patryk Jędrzejczak	1d09b9c8d0	group0_state_id_handler: consider state IDs of all non-ignored topology members It's not enough to consider only the current group 0 members. In the Raft-based recovery procedure, there can be nodes that haven't joined the current group 0 yet, but they have belonged to a different group 0 and thus have a non-empty group 0 state ID. We fix this issue in this commit by considering topology members instead. We don't consider ignored nodes as an optimization. When some nodes are dead, the group 0 state ID handler won't have to wait until all these nodes leave the cluster. It will only have to wait until all these nodes are ignored, which happens at the beginning of the first removenode/replace. As a result, tombstones of group 0 tables will be purged much sooner. We don't rename the `group0_members` variable to keep the change minimal. There seems to be no precise and succinct name for the used set of nodes anyway. We use `std::ranges::join_view` in one place because: - `std::ranges::concat` will become available in C++26, - `boost::range::join` is not a good option, as there is an ongoing effort to minimize external dependencies in Scylla.	2025-10-20 12:05:07 +02:00
Avi Kivity	87c0adb2fe	gdb: simplify and future-proof looking up coroutine frame type llvm recently updated [1] their coroutine debugging instructions. They now recommend looking up the variable __coro_frame in the coroutine function rather than constructing the name of the coroutine frame type from the ramp function plus __coro_frame_ty. Since the latter method no longer works with Clang 21 (I did not check why), and since the former method is blessed as being more compatible, switch to the recommended method. Since it works with both Clang 20 and Clang 21, it future proofs the script. [1] `6e784afcb5` Closes scylladb/scylladb#26590	2025-10-20 12:38:53 +03:00
Botond Dénes	1ab697693f	Merge 'compaction/twcs: fix use after free issues' from Lakshmi Narayanan Sreethar The `compaction_strategy_state` class holds strategy specific state via a `std::variant` containing different state types. When a compaction strategy performs compaction, it retrieves a reference to its state from the `compaction_strategy_state` object. If the table's compaction strategy is ALTERed while a compaction is in progress, the `compaction_strategy_state` object gets replaced, destroying the old state. This leaves the ongoing compaction holding a dangling reference, resulting in a use after free. Fix this by using `seastar::shared_ptr` for the state variant alternatives(`leveled_compaction_strategy_state_ptr` and `time_window_compaction_strategy_state_ptr`). The compaction strategies now hold a copy of the shared_ptr, ensuring the state remains valid for the duration of the compaction even if the strategy is altered. The `compaction_strategy_state` itself is still passed by reference and only the variant alternatives use shared_ptrs. This allows ongoing compactions to retain ownership of the state independently of the wrapper's lifetime. The method `maybe_wait_for_sstable_count_reduction()`, when retrieving the list of sstables for a possible compaction, holds a reference to the compaction strategy. If the strategy is updated during execution, it can cause a use after free issue. To prevent this, hold a copy of the compaction strategy so it isn’t yanked away during the method’s execution. Fixes #25913 Issue probably started after `9d3755f276`, so backport to 2025.4 Closes scylladb/scylladb#26593 * github.com:scylladb/scylladb: compaction: fix use after free when strategy is altered during compaction compaction/twcs: pass compaction_strategy_state to internal methods compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction	2025-10-20 10:45:47 +03:00
Ernest Zaslavsky	db1ca8d011	s3_client: just reformat `make_request` Just reformat previously changed methods to improve readability	2025-10-20 10:44:37 +03:00
Israel Fruchter	986e8d0052	Update tools/cqlsh submodule (v6.0.27) * tools/cqlsh ff3f572...f852b1f5 (2): > Add LZ4 as a required package - so ScyllaDB Python driver could use LZ4 compression > github actions: replace macos-13 with macos-15-intel Closes scylladb/scylladb#26608	2025-10-20 10:03:31 +03:00
Michael Litvak	b808d84d63	storage_service: improve colocated repair error to show table names When requesting repair for tablets of a colocated table, the request fails with an error. Improve the error message to show the table names instead of table IDs, because the table names are more useful for users. Fixes scylladb/scylladb#26567 Closes scylladb/scylladb#26568	2025-10-20 10:03:31 +03:00
Piotr Dulikowski	70b0cfb13e	Merge 'test: cluster: Replica exceptions tests' from Dario Mirovic This patch series introduces several tests that check number of exceptions that happens during various replica operations. The goal is to have a set of tests that can catch situations where number of exceptions per operation increases. It makes exception throw regressions easier to catch. The tests cover apply counter update and apply functionalities in the database layer. There are more paths that can be checked, like various semaphore wait timeouts located deeper in the code. This set of tests does not cover all code paths. Fixes #18164 This is an improvement. No backport needed. Closes scylladb/scylladb#25992 * github.com:scylladb/scylladb: test: cluster: test replica write timeout database: parameterize apply_counter_update_delay_5s injector value test: cluster: test replica exceptions - test rate limit exceptions	2025-10-20 10:03:31 +03:00
Piotr Dulikowski	a716fab125	Merge 'alternator/metrics: Log operation sizes to histograms' from Piotr Wieczorek This PR adds operation per-table histograms to Alternator with item sizes involved in an operation, for each of the operations: `GetItem`, `PutItem`, `DeleteItem`, `UpdateItem`, `BatchGetItem`, `BatchWriteItem`. If read-before-write wasn't performed (i.e. it was not needed by the operation and the flag `alternator_force_read_before_write` was disabled), then we log sizes of the items that are in the request. Also, `UpdateItem` logs the maximum of the update size and the existing item size. We'll change it in a next PR. Fixes: #25143 Closes scylladb/scylladb#25529 * github.com:scylladb/scylladb: alternator: Add UpdateItem and BatchWriteItem response size metrics alternator: Add PutItem and DeleteItem response size metrics alternator: Add BatchGetItem response size metrics alternator: Add GetItem response size metrics alternator/test: Add more context to test_metrics.py asserts	2025-10-20 10:03:31 +03:00
Lakshmi Narayanan Sreethar	18c071c94b	compaction: fix use after free when strategy is altered during compaction The `compaction_strategy_state` class holds strategy specific state via a `std::variant` containing different state types. When a compaction strategy performs compaction, it retrieves a reference to its state from the `compaction_strategy_state` object. If the table's compaction strategy is ALTERed while a compaction is in progress, the `compaction_strategy_state` object gets replaced, destroying the old state. This leaves the ongoing compaction holding a dangling reference, resulting in a use after free. Fix this by using `seastar::shared_ptr` for the state variant alternatives(`leveled_compaction_strategy_state_ptr` and `time_window_compaction_strategy_state_ptr`). The compaction strategies now hold a copy of the shared_ptr, ensuring the state remains valid for the duration of the compaction even if the strategy is altered. The `compaction_strategy_state` itself is still passed by reference and only the variant alternatives use shared_ptrs. This allows ongoing compactions to retain ownership of the state independently of the wrapper's lifetime. Fixes #25913 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-10-17 22:57:05 +05:30
Lakshmi Narayanan Sreethar	35159e5b02	compaction/twcs: pass compaction_strategy_state to internal methods During TWCS compaction, multiple methods independently fetch the compaction_strategy_state using get_state(). This can lead to inconsistencies if the compaction strategy is ALTERed while the compaction is in progress. This patch fixes a part of this issue by passing down the state to the lower level methods as parameters instead of fetching it repeatedly. Refs #25913 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-10-17 21:26:30 +05:30
Lakshmi Narayanan Sreethar	1cd43bce0e	compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction The method `maybe_wait_for_sstable_count_reduction()`, when retrieving the list of sstables for a possible compaction, holds a reference to the compaction strategy. If the strategy is updated during execution, it can cause a use after free issue. To prevent this, hold a copy of the compaction strategy so it isn’t yanked away during the method’s execution. Refs #26546 Refs #25913 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-10-17 21:26:30 +05:30
Dario Mirovic	1d93f342f9	test: cluster: test replica write timeout This patch introduces test `test_replica_database_apply_timeout`. It tests timeout on database write. The test uses error injection that returns timeout error if the injection `database_apply_force_timeout` is enabled. Refs #18164	2025-10-17 11:52:11 +02:00
Dario Mirovic	ff88fe2d76	database: parameterize apply_counter_update_delay_5s injector value Parameterize `apply_counter_update_delay_5s` injector value. Instead of sleeping 5s when the injection is active, read parameter value that specifies sleep duration. To reflect these changes, it is renamed to `apply_counter_update_delay_ms` and the sleep duration is specified in milliseconds. Refs #18164	2025-10-17 11:52:10 +02:00
Dario Mirovic	7dc0ff2152	test: cluster: test replica exceptions - test rate limit exceptions This patch introduces two tests for `replica::rate_limit_exception`. One test is for write/apply limit, the other one for read/query limit. The tests check the number of rate limit errors reported and the number of cpp exceptions reported. If somebody adds an exception throw on the rate limit paths, this test will catch it and fail. Refs #18164	2025-10-17 10:54:43 +02:00
Botond Dénes	01bcafbe24	Merge 'test: make various improvements in the recovery procedure tests' from Patryk Jędrzejczak This PR contains various improvements in the recovery procedure tests, mostly `test_raft_recovery_user_data`: - decreasing the running time, - some simplifications, - making sure group 0 majority is lost when expected. These are not critical test changes, so no need to backport. Closes scylladb/scylladb#26442 * github.com:scylladb/scylladb: test: assert that majority is lost in some tests of the recovery procedure test: rest_client: add timeout support for read_barrier test: test_raft_recovery_user_data: lose majority when killing one dc test: test_raft_recovery_user_data: shutdown driver sessions test: test_raft_recovery_user_data: use a separate driver connection for the write workload test: test_raft_recovery_user_data: send ALTER KEYSPACE to any node test: test_raft_recovery_user_data: bring failure_detector_timeout_in_ms back to 20 s test: test_raft_recovery_user_data: speed up replace operations test: stop/start servers concurrently in the recovery procedure tests	2025-10-17 10:54:05 +03:00
Piotr Wieczorek	a2b9d7eed5	alternator: Split `update_item_operation::apply` into smaller methods This is a minor refactoring aimed at reducing cognitive complexity of `update_item_operation::apply`. The logic remains unchanged. Closes scylladb/scylladb#25887	2025-10-17 09:51:05 +02:00
Taras Veretilnyk	d9be2ea69b	docs: improve nodetool getendpoints documentation Clarified and expanded the documentation for the nodetool getendpoints command, including detailed explanations of the --key and --key-components options. Added examples demonstrating usage with simple and composite partition keys. Closes scylladb/scylladb#26529	2025-10-17 10:40:54 +03:00
Pawel Pery	10208c83ca	vector_search: fix flaky dns_refresh_aborted test The test process like that: - run long dns refresh process - request for the resolve hostname with short abort_source timer - result should be empty list, because of aborted request The test sometimes finishes long dns refresh before abort_source fired and the result list is not empty. There are two issues. First, as.reset() changes the abort_source timeout. The patch adds a get() method to the abort_source_timeout class, so there is no change in the abort_source timeout. Second, a sleep could be not reliable. The patch changes the long sleep inside a dns refresh lambda into condition_variable handling, to properly signal the end of the dns refresh process. Fixes: #26561 Fixes: VECTOR-268 It needs to be backported to 2025.4 Closes scylladb/scylladb#26566	2025-10-17 09:33:17 +02:00
Pavel Emelyanov	7d0722ba5c	code: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-17 10:26:50 +03:00
Pavel Emelyanov	a88a36f5b5	code: Switch to seastar API level 9 In the new API the biggest change is to implement the only data_sink_impl::put(span<temporary_buffer>) overload. Encrypted file impl and sstables compress sink use fallback_put() helper that generates a chain of continuations each holding a buffer. The counting_data_sink in transport had mostly been patched to correct implementation by the previous patch, the change here is to replace vector argument with span one. Most other sinks just re-implement their put(vector<temporary_buffer>) overload by iterating over span and non-preemptively grabbing buffers from it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-17 10:26:50 +03:00
Pavel Emelyanov	9ece535b5e	transport: Open-code invoke_with_counting into counting_data_sink::put The former helper is implemented like this: future<> invoke_with_counting(fn) { if (not_needed) return fn(); return futurize_invoke(something).then([fn] { return fn() }).finally(something_else); } and all put() overloads are like future<> put(arg) { return invoke_with_counting([this, arg] { return lower_sink.put(arg); }); } The problem is that with seastar API level 9, the put() overload will have to move the passed buffers into stable storage before preempting. In its current implementation, when counting is needed the invoke_with_counting will link lower_sink.put() invocation to the futurize_invoke(something) future. Despite "something" is non-preempting, and futurize_invoke() on it returns ready future, in debug mode ready_future.then() does preempt, and the API level 9 put() contract will be violated. To facilitate the switch to new API level, this patch rewrites one of put() overloads to look like future<> put(arg) { if (not_needed) { return lower_sink.put(arg); } something; return lower_sink(arg).finally(something_else); } Other put()-s will be removed by next patch anyway, but this put() will be patched and will call lower_sink.put() without preemption. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-17 10:26:30 +03:00
Pavel Emelyanov	068d788084	transport: Don't use scattered_message The API to put scattered_message into output_stream() is gone in seastar API level 9, transport is the only place in Scylla that still uses it. The change is to put the response as a sequence of temporary_buffer-s. This preserves the zero-copy-ness of the reply, but needs few things to care about. First, the response header frame needs to be put as zero-copy buffer too. Despite output_stream() supports semi-mixed mode, where z.c. buffers can follow the buffered writes, it won't apply here. The socket is flushed() in batched mode, so even if the first reply populates the stream with data and flushes it, the next response may happen to start putting the header frame before delayed flush took place. Second, because socket is flushed in batch-flush poller, the temporary buffers that are put into it must hold the foreigh_ptr with the response object. With scattered message this was implemented with the help of a delter that was attached to the message, now the deleter is shared between all buffers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-17 10:17:08 +03:00
Pavel Emelyanov	d9808fafdb	utils: Implement memory_data_sink::put(net::packet) It's going to be removed by next-after-next patch, but the next one needs this overload implemented properly, so here it is. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-17 10:17:08 +03:00
Tomasz Grabiec	c4a87453a2	Merge 'Add experimental feature flag for strongly consistent tables and extend kesypace creation syntax to allow specifying consistency mode.' from Gleb Natapov The series adds an experimental flag for strongly consistent tables and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace. Closes scylladb/scylladb#26116 * github.com:scylladb/scylladb: schema: Allow configuring consistency setting for a keyspace db: experimental consistent-tablets option	2025-10-16 21:48:06 +02:00
Tomasz Grabiec	e6c427953e	Merge 'schema_applier: unify handling of token_metadata during schema change' from Marcin Maliszkiewicz This patchset improves the atomicity and clarity of schema application in the presence of token metadata updates during schema changes. The primary focus is to ensure that changes to tablet metadata are applied atomically as part of the schema commit phase, rather than being replicated to all cores afterward, which previously violated atomicity guarantees. Key changes: - Introduced pending_token_metadata to unify handling of new and existing metadata. - Split token metadata replication into prepare and commit steps. - Abstracted schema dependencies in storage_service to support pending schema visibility. - Applied tablet metadata updates atomically within schema commit phase. Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/24414 Closes scylladb/scylladb#25302 * github.com:scylladb/scylladb: db: schema_applier: update tablet metadata atomically db: replica: move tables_metadata locking to commit storage_service: abstract schema dependecies during token metadata update storage_service: split replicate_to_all_cores to steps db: schema_applier: unify token_metadata loading replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge service: fix dependencies during migration_manager startup db: schema_applier: move pending_token_metadata to locator db: always use _tablet_hint as condition for tablet metadata change db: refactor new_token_metadata into pending_token_metadata db: rename new_token_metadata to pending_token_metadata db: schema_applier: move types storage init to merge_types func db: schema_applier: make merge functions non-static members db: remove unused proxy from create_keyspace_metadata	2025-10-16 21:43:49 +02:00
Piotr Wieczorek	caa522a29d	alternator: Add UpdateItem and BatchWriteItem response size metrics This commit bundle introduces metrics on item sizes for Alternator operations. The new metrics are: - `operation_size_kib op=UpdateItem`: Tracks the size of an `UpdateItem` operation. This is calculated as the sum of the existing item's size plus the estimated size of the updated fields. - `operation_size_kib op=BatchWriteItem`: Tracks the total size of items within a `BatchWriteItem` request, aggregated on a per-table basis. If an item already exists, the logged size is the maximum of the old and the new item size. NOTE: Both metrics rely on read-before-write, so if the `alternator_force_read_before_write` option is disabled, these metrics may be incomplete and report inaccurate sizes.	2025-10-16 19:17:27 +02:00
Piotr Wieczorek	5ca42b3baf	alternator: Add PutItem and DeleteItem response size metrics This commit bundle introduces metrics on item sizes for Alternator operations. Specifically, this commit adds `operation_size_kb` histograms for sizes of items created or replaced by the `PutItem` operation, and sizes of items deleted by `DeleteItem` requests. The latter needs a read-before-write, so the metrics may be incomplete if `alternator_force_read_before_write` is disabled.	2025-10-16 19:17:26 +02:00
Piotr Wieczorek	5c72fd9ea3	alternator: Add BatchGetItem response size metrics This commit bundle introduces metrics on item sizes for Alternator operations. Specifically, this commit adds a `operation_size_kb` per-table histogram, which contains item sizes in BatchGetItem requests. A size of a BatchGetItem is the sum of the sizes of all items in the operation grouped by table. In other words, a single BatchGetItem, and BatchWriteItem for that matter, updates the histograms for each table that it has items in.	2025-10-16 19:16:57 +02:00
Piotr Wieczorek	1aa3819b57	alternator: Add GetItem response size metrics This commit bundle introduces metrics on item sizes for Alternator operations. Specifically, this commit adds a per-table `operation_size_kb` histogram, recording the sizes of the items contained in GetItem responses.	2025-10-16 19:04:55 +02:00
Piotr Dulikowski	44257f4961	Merge 'raft topology: disable schema pulls in the Raft-based recovery procedure' from Patryk Jędrzejczak Schema pulls should always be disabled when group 0 is used. However, `migration_manager::disable_schema_pulls()` is never called during a restart with `recovery_leader` set in the Raft-based recovery procedure, which causes schema pulls to be re-enabled on all live nodes (excluding the nodes replacing the dead nodes). Moreover, schema pulls remain enabled on each node until the node is restarted, which could be a very long time. We fix this issue and add a regression test in this PR. Fixes #26569 This is an important bug fix, so it should be backported to all branches with the Raft-based recovery procedure (2025.2 and newer branches). Closes scylladb/scylladb#26572 * github.com:scylladb/scylladb: test: test_raft_recovery_entry_loss: fix the typo in the test case name test: verify that schema pulls are disabled in the Raft-based recovery procedure raft topology: disable schema pulls in the Raft-based recovery procedure	2025-10-16 18:48:38 +02:00
Emil Maskovsky	6769c313c2	raft: small fixes for voters code Minor cleanups and improvements to voter-related code. No backport: cleanup only, no functional changes. Closes scylladb/scylladb#26559	2025-10-16 18:41:08 +02:00
Ernest Zaslavsky	55fb2223b6	s3_client: unify `make_request` implementation Refactor `make_request` to use a single core implementation that handles authentication and issues the HTTP request. All overloads now delegate to this unified method.	2025-10-16 15:51:28 +03:00
Piotr Wieczorek	1559021c4e	alternator/test: Add more context to test_metrics.py asserts This commit adds more information to the assert messages to ease in debugging. The semantics of the asserts remains the same.	2025-10-16 14:41:19 +02:00
Piotr Dulikowski	a8d92f2abd	test: mv: add a test for tablet merge The test test_mv_tablets_replace verifies that merging tablets of both a view and its base table is allowed if rf-rack-valid-keyspaces option is enabled (and it is enabled by default in the test suite).	2025-10-16 14:07:37 +02:00
Piotr Dulikowski	359ed964e3	tablet_allocator, tests: remove allow_tablet_merge_with_views injection The `allow_tablet_merge_with_views` error injection was previously used to allow merging tablets in a table which has materialized views attached to it. Now, the error injection is not needed because this is allowed under the rf-rack-valid condition, which is enabled by default in tests. Remove the error injection from the code and adjust the tests not to use it.	2025-10-16 14:07:37 +02:00
Piotr Dulikowski	189ad96728	tablet_allocator: allow merges in base tables if rf-rack-valid=true Tablet merge of base tables is only safe if there is at most one replica in each rack. For more details on why it is the case please see scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on, this condition is satisfied, so allow it in that case. Fixes: scylladb/scylladb#26273	2025-10-16 13:02:05 +02:00
Gleb Natapov	c255740989	schema: Allow configuring consistency setting for a keyspace We want to add strongly consistent tables as an option. We will have two kind of strongly consistent tables: globally consistent and locally consistent. The former means that requests from all DCs will be globally linearisable while the later - only requests to the same DCs will be linearisable. To allow configuring all the possibilities the patch adds new parameter to a keyspace definition "consistency" that can be configured to be `eventual`, `global` or `local`. Non eventual setting is supported for tablets enabled keyspaces only. Since we want to start with implementing local consistency configuring global consistency will result in an error for now.	2025-10-16 13:34:49 +03:00
Avi Kivity	8f1de2a7ad	Merge 'test/boost: speed up test test_indexing_paging_and_aggregation by making internal page size configurable' from Nadav Har'El The C++ test `test_indexing_paging_and_aggregation` is one of the slowest tests in test/boost. The reason for its slowness is that it needs a table with more rows than SELECT's "DEFAULT_COUNT_PAGE_SIZE" which was hard-coded to 10,000, so the test needed to write and read tens of thousands of rows, and did it multiple times. It turns out the code actually had an ad-hoc mechanism to override DEFAULT_COUNT_PAGE_SIZE in a C++ test, but both this mechanism and the test itself were so opaque I didn't find it until I fixed it in a different way: What I ended up doing in this pull request is the following (each step in a separate patch): 1. Rewrite this test in Python, in the test/cqlpy framework. This was straightforward, as this test only used CQL and not internal interfaces. The reason why this test wasn't written in Python in the first place is that it was written in 2019, a year before cqlpy existed. A added extensive comments to the new tests, and I finally understood what it was doing :-) 2. I replaced the ad-hoc C++-test-only mechanism of overriding DEFAULT_COUNT_PAGE_SIZE by a bona-fide configuration parameter, `select_internal_page_size`. 3. Finally, the Python test can temporarily lower `select_internal_page_size` and use a table with much fewer rows. After this series, the test `test_indexing_paging_and_aggregation` (which is now in Python instead of C++) takes around half a second, 20 times faster than before. I expect the speedup to be even more dramatic for the debug build. Closes scylladb/scylladb#25368 * github.com:scylladb/scylladb: cql: make SELECT's "internal page size" configurable secondary index: translate test_indexing_paging_and_aggregation to Python	2025-10-16 11:58:13 +03:00
Marcin Maliszkiewicz	47dba4203a	db: schema_applier: update tablet metadata atomically Before mutable_token_metadata_ptr containing tablet changes was replicated to all cores in post_commit phase which violated atomicy guarantee of schema_applier, now it's incorporated into per shard commit phase. It uses service::schema_getter abstraction introduced in earlier commit to inject "pending" schema which is not yet visible to the whole system.	2025-10-16 10:56:50 +02:00
Marcin Maliszkiewicz	e5fffa158f	db: replica: move tables_metadata locking to commit This keeps the locking scope minimal, and since unlocking is done in commit(), locking fits here as well.	2025-10-16 10:56:10 +02:00
Marcin Maliszkiewicz	92cfc3c005	storage_service: abstract schema dependecies during token metadata update The functions prepare_token_metadata_change and commit_token_metadata_change depend on the current schema through calls to the database service. However, during an atomic schema change, the current schema does not yet include the pending changes. Despite that, we want to apply token metadata changes to those pending schema elements as well. Currently, this is achieved by postponing token metadata changes until after the rest of the schema is committed, but this breaks atomicity. To allow incorporating the prepare and commit phases into schema_applier, we need to abstract the schema dependency. This will make it possible to provide, in following commits, an implementation that includes visibility into pending changes, not just the currently active schema.	2025-10-16 10:56:09 +02:00
Botond Dénes	5d70450917	replica/mutation_dump: multi_range_partition_generator: disable garbage-collection Make use of the freshly introduced facility to disable garbage-collection on a per-query basis for range scans. This is needed so partitions that only contain garbage-collectible data are not missing from the partition-list. When using SELECT * FROM MUTATION_FRAGMENTS(), the user is expecting to see all data, even that which is dead and garbage-collectible. Include a test which reproduces the issue.	2025-10-16 10:40:28 +03:00
Botond Dénes	734a9934a6	replica: add tombstone_gc_enabled parameter to mutation query methods Allow disabling tombstone gc on a per-query basis for mutation queries. This is achieved by a bool flag passed to mutation query variants like `query_mutations_on_all_shards()` and `database::mutation_query()`, which is then propagated down to compaction_mutation_state. The future user (in the next patch) is the SELECT * FROM MUTATION_FRAGMENTS() statement which wants to see dead partitions (and rows) when scanning a table. Currently, due to garbage collections, said statement can miss partitions which only contain garbage-collectible tombstones.	2025-10-16 10:38:47 +03:00
Botond Dénes	03118a27b8	mutation/mutation_compactor: remove _can_gc member It is confusing. For query compaction, it initialized to `always_gc`, for sstable compaction it is initialized to a lambda calling into `can_gc()`. This makes understanding the purpose of this member very confusing. The real use of this member is to bridge mutation_partition::compact_and_expire() with can_gc(). This patch ditches the member and creates the lambda near the call sites instead, just like the other params to `compact_and_expire()` already are. can_gc() now also respects _tombstone_gc.is_gc_enabled() instead of just blindly returning true when in query mode. With this patch, whether tombstones are collected or not in query mode is now consistent and controlled by the tombstone_gc_state.	2025-10-16 10:38:47 +03:00
Botond Dénes	cb27c3d6e9	tombstone_gc: add tombstone_gc_state factory methods for gc_all and no_gc Currently, to disable tombstone-gc on-demand completely, one has to pass down a bool flag along with the already required tombstone_gc_state to the code which does the compacting. This is redundant and confusing, the tombstone_gc_state is supposed to encapsulate all tombstone-gc related logic in a transparent way. Add dedicated factory methods for no-gc and gc-all, to allow creating a tombstone_gc_state which transparently gcs for all or no tombstones.	2025-10-16 10:38:47 +03:00
Piotr Wieczorek	15c399ed40	test/alternator: Add more Streams tests for UpdateItem and BatchWriteItem This commit adds tests to `test_streams.py` (i.e. Alternator Streams) checking the following cases: * putting an item with BatchWriteItem shouldn't emit a log if the old item and the new item are identical, * deleting an item with BatchWriteItem shouldn't emit a log if the item doesn't exist, * UpdateItem shouldn't emit a log if the old item and the new item are identical. These cases haven't been tested until this commit. Refs https://github.com/scylladb/scylladb/issues/6918 Closes scylladb/scylladb#26396	2025-10-16 09:34:12 +03:00
Pavel Emelyanov	dbca0b8126	Update seastar submodule * seastar 270476e7...bd74b3fa (20): > memory: Decay large allocation warning threshold > iotune: fix very long warm up duration on systems with high cpu count > Add lib info to one line backtrace > io: Count and export number of AIO retries > io_queue: Destroy priority class data with scheduling group > Merge 'Expell net::packet from output_stream API stack' from Pavel Emelyanov code: Introduce new API level iostream: Remove write()-s of packet/scattered_message from new API level iostream: Convert output_stream::_zc_bufs to vector of buffers code: Add data_sink_impl::put(std::span<temporary_buffer>) method code: Prepare some data_sink_impl::do_put(temporary_buffer) methods iostream: Introduce output_stream::write(span<temporary_buffer>) overload packet: Add packet(std::span<temporary_buffer>) constructor temporary_buffer: Add detach_front() helper > cooking: update gnutls to 3.7.11 > file: Configure DMA alignment from block size > util: adapt to fmt 12.0.0 API changes > Merge 'Internalize reactor::posix_... API methods' from Pavel Emelyanov reactor: Deprecate and internalize posix_connect() reactor: Deprecate and internalize posix_listen() > cooking: update fmt to modern version > Merge 'Add prometheus bench, coroutinize prometheus' from Travis Downs prometheus: coroutinize metrics writing prometheus_test: add global label test introduce metrics_perf bench > operator co_await: use rvalue reference > futurize::invoke: use std::invoke > io_tester: Don't skip 0 position in sequential workflows > io_queue: Use own logger for messages > .clangd: tell the LSP about seastar's header style > docker: Update to plucky > Merge 'Convert timer test into seastar test (and a bit more)' from Pavel Emelyanov test: Remove OK macro test: Fix one failure check test: Use boost checkers instead of BUG() macro test: Fix indentation after previous patch test: Convert timer_test into seastar test(s) Closes scylladb/scylladb#26560	2025-10-16 07:55:17 +03:00
Nadav Har'El	921d07a26b	cql: make SELECT's "internal page size" configurable In some uses of SELECT, such as aggregation (sum() et al.), GROUP BY or secondary index, it needs to perform internal scans. It uses an "internal page size" which before this patch was always DEFAULT_COUNT_PAGE_SIZE = 10000. There was an ad-hoc and undocumented way to override this default in C++ tests, using functions in test/lib/select_statement_utils.hh, but it was so non-obvious that the test that most needed to override this default - the very slow test test_indexing_paging_and_aggregation which would have been must faster with a lower setting - never used it. So in this patch we replace the ad-hoc configuration functions by a bona-fide Scylla configuration option named "select_internal_page_size". The few C++ tests that used the old configuration functions were modified to use the new configuration parameters. The slow test test_indexing_paging_and_aggregation still doesn't use the new configuration to become faster - we'll do this in the next patch. Another benefit of having this "internal page size" as a configuration option is that one day a user might realize that the default choice 10,000 is bad for some reason (which I can't envision right now), so having it configurable might come it handy. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-15 18:42:09 +03:00
Patryk Jędrzejczak	71de01cd41	test: test_raft_recovery_entry_loss: fix the typo in the test case name	2025-10-15 16:58:28 +02:00
Patryk Jędrzejczak	da8748e2b1	test: verify that schema pulls are disabled in the Raft-based recovery procedure We do this at the end of `test_raft_recovery_entry_loss`. It's not worth to add a separate regression test, as tests of the recovery procedure are complicated and have a long running time. Also, we choose `test_raft_recovery_entry_loss` out of all tests of the recovery procedure because it does some schema changes.	2025-10-15 16:58:28 +02:00
Patryk Jędrzejczak	ec3a35303d	raft topology: disable schema pulls in the Raft-based recovery procedure Schema pulls should always be disabled when group 0 is used. However, `migration_manager::disable_schema_pulls()` is never called during a restart with `recovery_leader` set in the Raft-based recovery procedure, which causes schema pulls to be re-enabled on all live nodes (excluding the nodes replacing the dead nodes). Moreover, schema pulls remain enabled on each node until the node is restarted, which could be a very long time. The old gossip-based recovery procedure doesn't have this problem because we disable schema pulls after completing the upgrade-to-group0 procedure, which is a part of the old recovery procedure. Fixes #26569	2025-10-15 16:58:24 +02:00
Nadav Har'El	afc5379148	secondary index: translate test_indexing_paging_and_aggregation to Python The Boost test test_indexing_paging_and_aggregation is one of the slowest boost tests. But it's hard to understand why it needs to be so slow - the C++ test code is opaque, and uncommented. The test didn't need to be in C++ - it only uses CQL, not any internal interfaces - but it was written in 2019, a year before test/cqlpy was created. So before we can make this test faster, this patch translates it to Python and adds significant amount of comments. The new Python test is functionally identical to the old C++ test - it is not (yet) made smaller or faster. The new test takes a whopping 9 seconds to run on my laptop (in dev build mode). We'll reduce that in the next patch. As usual, the cqlpy test can also be tested on Cassandra, and unsurprisingly, it passes. Refs #16134 (which asks to translate more MV and SI tests to Python). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-15 17:50:37 +03:00
Piotr Dulikowski	61662bc562	Merge 'alternator: Make CDC use preimages from LWT for Alternator' from Piotr Wieczorek This patch adds a struct `per_request_options` used to communicate between CDC and upper abstraction layers. We need this for better compatibility with DynamoDB Streams in Alternator (https://github.com/scylladb/scylladb/issues/6918) to change operation types of log rows. This patch also adds a way to conditionally forward the item read by LWT to CDC and use it as a preimage. For now, only Alternator uses this feature. The main changes are: - add a struct `cdc::per_request_options` to pass information between CDC and upper abstraction layers, - add the struct to `cas_request::apply`'s signature, - add a possibility to provide a preimage fetched by an upper abstraction layer (to propagate a row read by Alternator to CDC's preimage). This reduces the number of reads-before-write by 1 for some Alternator requests and it is always safe. It's possible to use this feature also in CQL. No backport, it's a feature. Refs https://github.com/scylladb/scylladb/issues/6918 Refs https://github.com/scylladb/scylladb/pull/26121 Closes scylladb/scylladb#26149 * github.com:scylladb/scylladb: alternator, cdc: Re-use the row read by LWT as a CDC preimage cdc: Support prefetched preimages storage: Add cdc options to cas_request::apply cdc, storage: Add a struct to pass per-mutation options to CDC cdc: Move operations enum to the top of the namespace	2025-10-15 12:30:29 +02:00
Piotr Wieczorek	28eda0203e	alternator: Small cleanup, removing unnecessary statements, etc. Tiny code cleanup to improve readability without changing behavior. Changes: - remove unused variables and imports, - remove redundant whitespaces, and a duplicated `public:` access specifier, - use `is_aws` function to check if running in AWS test/alternator/test_metrics.py, - other trivial changes. Closes scylladb/scylladb#26423	2025-10-15 12:05:20 +02:00
Pavel Emelyanov	7bd50437ff	test: Remove unused operator<<(radix_tree_test::test_data) It was used while debugging the test Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26458	2025-10-15 11:57:56 +02:00
Marcin Maliszkiewicz	106fd39c6c	storage_service: split replicate_to_all_cores to steps In later commits schema merge code will use those prepare and commit steps. Rest of the code will continue using replicate_to_all_cores.	2025-10-15 10:54:24 +02:00
Gleb Natapov	eb9112a4a2	db: experimental consistent-tablets option The option will be used to hid consistent tablets feature until it is ready.	2025-10-15 11:27:10 +03:00
Dawid Mędrek	3aa07d7dfe	test/cluster/mv: Provide reason why test is skipped We point to the issue explaining why the test was disabled and what can be done about it. Closes scylladb/scylladb#26541	2025-10-15 09:22:39 +02:00
Jenkins Promoter	d731d68e66	Update pgo profiles - aarch64	2025-10-15 05:21:46 +03:00
Jenkins Promoter	b6237d7dd4	Update pgo profiles - x86_64	2025-10-15 04:54:54 +03:00
Piotr Dulikowski	aed166814e	test: cluster: skip flaky test_raft_recovery_entry_lose test Unfortunately, the test became flaky and is blocking promotion. The cause of the flaky is not known yet but unrelated to other items currently queued on the `next` branch. The investigation continues on GitHub issue scylladb/scylladb#26534. In the meantime, skip the test to unblock other work. Refs: scylladb/scylladb#26534 Closes scylladb/scylladb#26549	2025-10-14 19:35:44 +02:00
Botond Dénes	d0844abb5c	mutation/mutation_compactor: compaction:stats: split partitions Into total and live. Currently only live (those with live content) are counted. Report live and total seprately, just like we do for rows. This allows deducing the count of dead partitions as well, which is particularly interesting for scans. Closes scylladb/scylladb#26548	2025-10-14 19:08:47 +03:00
Marcin Maliszkiewicz	b0f11b6d91	db: schema_applier: unify token_metadata loading Putting it into a single place gives more clarity on how _pending_token_metadata is made and avoids extra per shard copy when tablets change.	2025-10-14 10:56:37 +02:00
Marcin Maliszkiewicz	d67632bfe2	replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge This copy is now used during the whole duration of schema merge. If it changes due to tablet_hint then it's replicated to all shards as before.	2025-10-14 10:56:36 +02:00
Marcin Maliszkiewicz	389afcdeb6	service: fix dependencies during migration_manager startup We need to avoid reloading schema early as it goes via schema_applier which internally depends on storage_service and on distribued_loader initializing all keyspaces. Simply moving migration manager startup later in the code is not easy as some services depend on it being initialized so we just enable those feature listeners a bit later.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	46bff28a38	db: schema_applier: move pending_token_metadata to locator It never belonged to tables and views and its placement stems from location of _tablet_hint handling code. In the follwing commits we'll reference it in storage_service.cc.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	1a539f7151	db: always use _tablet_hint as condition for tablet metadata change When all schema_applier code uses this condition it's easier to grep than when we use different, derived conditions.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	c112916215	db: refactor new_token_metadata into pending_token_metadata It prepares pending_token_metadata to handle both new and copy of existing metadata for consistent usage in later commit. It also adds shared_token_metatada getter so that we don't need to get it from db.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	668231d97c	db: rename new_token_metadata to pending_token_metadata Part of the refactor done in following commit. Separated for easier review.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	0c4c995c0d	db: schema_applier: move types storage init to merge_types func Merge_types function groups operation related to types, types storage fits this group.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	794d68e44c	db: schema_applier: make merge functions non-static members This is mechanical change which simplifies the code. Schema_applier class is an object which holds schema merging intermediate state so it's fine that all schema merging functions have access to this state.	2025-10-14 10:56:25 +02:00
Marcin Maliszkiewicz	209563f478	db: remove unused proxy from create_keyspace_metadata	2025-10-14 10:56:25 +02:00
Ernest Zaslavsky	413739824f	s3_client: track memory starvation in background filling fiber Introduce a counter metric to monitor instances where the background filling fiber is blocked due to insufficient memory in the S3 client. Closes scylladb/scylladb#26466	2025-10-14 11:22:54 +03:00
Piotr Wieczorek	5ff2d2d6ab	alternator, cdc: Re-use the row read by LWT as a CDC preimage Propagates the row read by CAS to CDC's preimage to save one read-before-write. As of now, a preimage in Alternator Streams always contains the entire item (see previous_item_read_command in executor.cc), so the resulting preimage should stay the same. In other words, this change should be transparent to users.	2025-10-14 07:52:40 +02:00
Piotr Wieczorek	d4581cc442	cdc: Support prefetched preimages This commit adds support to pass a preimage selected by an upper layer to CDC. The responsibility for the correctness of the preimage (i.e. the selected columns, whether it's up to date, etc.) lies with the caller. It may be improved in the future by validating the preimage, e.g. by "slicing" the received preimage to the necessary columns. The motivation behind this change was to reduce the number of read-before-writes and avoid reading the row twice for Alternator Streams in an increased compatibility mode with DynamoDB. This is to be added in a following commit. Until now, this commit should be a no-op.	2025-10-14 07:29:07 +02:00
Łukasz Paszkowski	125bf391a7	utils/directories: ignore files when retrieving stats fails During Scylla startup, directories are created and verified in `directories::do_verify_owner_and_mode()`. It is possible that while retrieving file stats, a file might be removed, leading to Scylla failing to boot. This is particularly visible in `storage/test_out_of_space.py` tests, which use FUSE to mount size-limited volumes. When a file that is open by another process is removed, FUSE renames it to `.fuse_hidden*`. In `directories::do_verify_owner_and_mode()`, the code performs a `scan_dir` to list files and retrieves their stats to verify type, mode, and ownership. If a file is removed while retrieving its stats, we see errors such as: ``` Failed to get /scylladir/testlog/x86_64/dev/volumes/e0125c60-1e63-4330-bf6f-c0ea3e466919/scylla-0/hints/1/.fuse_hidden0000001800000005 ``` This change makes `do_verify_owner_and_mode()` ignore files when retrieving stats fails, avoiding spurious errors during verification. Refs: https://github.com/scylladb/scylladb/issues/26314 Closes scylladb/scylladb#26535	2025-10-13 20:41:25 +03:00
Botond Dénes	46af0127e9	test/cqlpy/test_tools.py: add test for scylla-sstable write --input-format=cql Comprehensive test for the new CQL input format.	2025-10-13 18:10:40 +03:00
Botond Dénes	180bf647f7	replica/mutation_dump: add support for virtual tables Not supported currently as such tables have no memtables, cache or sstables, so any select * from mutation_fragments() query will return empty result. Detect virtual tables and add return their content with a distinct 'virtual-table' mutation_source designation.	2025-10-13 18:10:40 +03:00
Botond Dénes	64c32ca501	tools/scylla-sstable: print_query_results_json(): handle empty value buffer Print null, similar to disengaged optional value.	2025-10-13 18:10:40 +03:00
Botond Dénes	e404dd7cf0	tools/scylla-sstable: add cql support to write operation Add new --input-format command line argument. Possible values are json (current) and cql (new -- added in this patch). When --input-format=cql (new default), the input-file is expected to contain CQL INSERT, UPDATE or DELETE statements, separated by semicolon. The input file can contain any number of statements, in any order. The statements will be executed and applied to a memtable, which is then flushed to create an sstable with the content generated from the statement. The memtable's size is capped at 1MiB, if it reaches this size, it is flushed and recreated. Consequently, multiple sstables can be created from a single scylla-sstable write --input-format=cql operation.	2025-10-13 18:10:40 +03:00
Dawid Mędrek	7d017748ab	db/commitlog: Extend segment truncation error messages We include more relevant information for debugging purposes: the remaining bytes and the size. It might be useful to determine where exactly an error occurred and help reason about it. Closes scylladb/scylladb#26486	2025-10-13 17:42:31 +03:00
Nadav Har'El	06108ea020	test/alternator: a small cleanup for a test in test_streams.py This patch makes three small mostly-cosmetic improvements to a test in test/alternator/test_streams.py: 1. The test is renamed "test_streams_deleteitem_old_image_no_ck" to emphasize its focus on the combination of deleteitem, old image, and no ck. The "putitem" we had in the name was not relevant, and the "old_image" was missing and important. 2. Moreover, using PutItem in this test just to set up the test scenario mixed the bug which the test tries to reproduced with a different only-recently-fixed bug (that PutItem also generated a spurious "REMOVE" event). So I changed the use of PutItem by using UpdateItem, to make this test indepedent of the other bug. Test independence is important because it allows us - if we want - to backport a fix for just one bug independently of the fix to the other bug. 3. Also improved the comment in front of the test to mention where we already tested the with-ck case, and also to mention issue 26382 which this test reproduces (the xfail line also mentions it, but the xfail line will be removed when the bug is fixed - but the mention in the comment will remain - and should remain. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26526	2025-10-13 17:42:31 +03:00
Piotr Dulikowski	1cf944577b	Merge 'Fix vector store client flaky test' from Karol Nowacki This series of patches improves test vector_store_client_test stability. The primary issue with flaky connections was discovered while working on PR #26308. Key Changes: - Fixes premature connection closures in the mock server: The mock HTTP server was not consuming request payloads, causing it to close connections immediately after a response. Subsequent tests attempting to reuse these closed connections would fail intermittently, leading to flakiness. The server has been updated to handle payloads correctly. - Removes a retry workaround: With the underlying connection issue resolved, the retry logic in the vector_store_client_test_ann_request test is no longer needed and has been removed. - Mocks the DNS resolver in tests: The vector_store_client_uri_update_to_invalid test has been corrected to mock DNS lookups, preventing it from making real network requests. - Corrects request timeout handling: A bug has been fixed where the request timeout was not being reset between consecutive requests. - Unifies test timeouts: Timeouts have been standardized across the test suite for consistency. Fixes: #26468 It is recommended to backport this series to the 2025.4 branch. Since these changes only affect test code and do not alter any production logic, the backport is safe. Addressing this test flakiness will improve the stability of the CI pipeline and prevent it from blocking unrelated patches. Closes scylladb/scylladb#26374 * github.com:scylladb/scylladb: vector_search: Unify test timeouts vector_search: Fix missing timeout reset vector_search: Refactor ANN request test vector_search: Fix flaky connection in tests vector_search: Fix flaky test by mocking DNS queries	2025-10-13 17:42:31 +03:00
Botond Dénes	f4f99ece7d	tools/scylla-sstable: write_operation(): fix indentation Left broken in previous patch for easier review.	2025-10-13 17:35:50 +03:00
Botond Dénes	023aed0813	tools/scylla-sstable: write_operation(): prepare for a new input-format In the next patches a new input-format will be introduced, which can produce multiple output format. To prepare for this, consolidate the code which produces an sstable into a reusable lambda function. Moves code around, reduces churn in next patches. Indentation is left broken for easier review.	2025-10-13 17:35:50 +03:00
Botond Dénes	f03cec9574	tools/scylla-sstable: generalize query_operation_validate_query() Make error messages more generic, so they are not specific to select. Make it a template on the type of cql statement for the final check. To avoid templating the whole thing, the function is split into two. Parametrize the name of the allowed statement types in said check. Prepares the method to be shared between query operation and write operation (future change). While at it, also change query param type to std::string_view to avoid some copies.	2025-10-13 17:35:50 +03:00
Botond Dénes	61e70d1b11	tools/scylla-sstable: move query_operation_validate_query() Move it in the source code above write_operation(), as said operation will soon want to use this method too.	2025-10-13 17:35:50 +03:00
Botond Dénes	dfe4cfc0e2	tools/scylla-sstable: extract schema transformation from query operation This transformation enables an existing schema to be created as a table in cql_test_env, to be used to read/write sstables belonging to said schema. Extract this into a method, to be shared by a future operation which will also want to do this.	2025-10-13 17:35:50 +03:00
Botond Dénes	970d4f0dcd	replica/table: add virtual write hook to the other apply() overload too Currently only one has it, which means virtual table can potentially miss some writes.	2025-10-13 17:35:50 +03:00
Calle Wilund	68c109c2df	docs::dev::object_storage: Add some initial info on GS storage Augments the object storage document with config options etc for using GS instead of S3. TODO: add proper gsutil command line examples for manual managing of GCP storage.	2025-10-13 08:53:28 +00:00
Calle Wilund	54a7d7bd47	docs/dev: Add mention of (nested) docker usage in testing.md As one (of more?) places to document the fact that we partially rely on resolving docker images for CI.	2025-10-13 08:53:28 +00:00
Calle Wilund	403247243b	sstables::object_storage_client: Forward memory limit semaphore to GS instance Enforces object storage limits to the GS implementation as well.	2025-10-13 08:53:28 +00:00
Calle Wilund	01f4dfed84	utils::gcp::object_storage: Add optional memory limits to up/download Adds optional memory semaphore to limit the mem buffer usage in sink/source. Note that we don't bookkeep exact, to avoid deadlock issues in higher layer. In upload, we overlease on first buffer put to ensure we can at least fill the desired 8M of buffers. We try to adjust when going over, but if we fail, we fail, but at least will initiate upload -> soon release memory. On next put, we try to grab multiples of 8M again, and so forth. Thus potentially causing waiting for resources, without ending up not uploading at least one active sink. For download (source), we try to get lease for as much as we want to read, but if we fail, we adjust this down to 256k and download anyway. Since this will typically be released immediately, we at least don't overrun for long, and again, avoid fully stopping, throttling rate instead.	2025-10-13 08:53:27 +00:00
Calle Wilund	5e4e5b1f4a	sstables::object_storage_client: Add multi-upload support for GS Uses file splitting + object merge to facilitate parallel, resumable upload of files with known size.	2025-10-13 08:53:27 +00:00
Calle Wilund	bd1304972c	utils::gcp::storage: Add merge objects operation Allows merging 1-32 smaller files into a destination.	2025-10-13 08:53:27 +00:00
Calle Wilund	e940a1362a	test_backup/test_basic: Make tests multiplex both s3 and gs backends Change fixture used + property/config access to allow running with arbitrary bucket-object based backend.	2025-10-13 08:53:27 +00:00
Calle Wilund	80c02603a8	test::cluster::conftest: Add support for multiple object storage backends Adds an `object_storage` fixture with paramterization to iterate through 's3' and 'gs' backends. For the former, will instansiate the `s3_server` backend (modified to better handle being actual temp, function level server). For the latter, will either give back a frontend if env vars indicating "real" GS buckets and endpoints are used, or launch a docker image for fake-gcs-server on a free port. Please read the comment in the code about the management of server output, as this is less than optimal atm, but I can't figure out the issue with it. All returned fixture objects will respond to `address`, `bucket` properties, as well as be able to create endpoint config objects for scylla.	2025-10-13 08:53:27 +00:00
Calle Wilund	da36a9d78e	boost::gcs_storage_test: reindent Remove redundant indentation/moosewings.	2025-10-13 08:53:27 +00:00
Calle Wilund	1356f60c69	boost::gcs_storage_test: Convert to use fixture Instead of test-local server/endpoint etc, use the gcs test fixture, with the added bonus of a suite-shared one for additional speed.	2025-10-13 08:53:27 +00:00
Calle Wilund	7c6b4bed97	tests::boost: Add GS object storage cases to mirror S3 ones I.e. run same remote storage backend unit tests for GS backend	2025-10-13 08:53:27 +00:00
Calle Wilund	af2616d750	tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS A text fixture object for either real google storage or fake-gcs-server using test local podman. Copied/transposed from gcp_object_storage_test.	2025-10-13 08:53:26 +00:00
Calle Wilund	a33fdd0b62	tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env Move some code to compilation unit + add some overloads. Add a RAII-object for temporary setting current process env as well.	2025-10-13 08:53:26 +00:00
Calle Wilund	527a6df460	sstables::object_storage_client: Add google storage implementation Allowing GS config to be honoured.	2025-10-13 08:53:26 +00:00
Calle Wilund	956d26aa34	test_services: Allow testing with GS object storage parameters	2025-10-13 08:53:26 +00:00
Calle Wilund	da7099a56e	utils::gcp::gcp_credentials: Add option to create uninitialized credentials To avoid having to async wait for creating credentials, allow lazy init (in actual token renew) of credentials. This is not super pleasant, since it means any error will be late, but it is required more or less for the code paths into which we intend to place this.	2025-10-13 08:53:26 +00:00
Calle Wilund	fd13ffd95d	utils::gcp::object_storage: Make create_download_source return seekable_data_source Since, given the nature of object storage API:s, it is no more complicated to provide a reasonable implementation of a seekable, limited, interface, give this back, which in turn means upper layers can provide easy read-only file interfaces. Hint hint.	2025-10-13 08:53:26 +00:00
Calle Wilund	cc899d4a86	utils::gcp::object_storage: Add defensive copies of string_view params	2025-10-13 08:53:26 +00:00
Calle Wilund	2093e7457d	utils::gcp::object_storage: Add missing retry backoff increate Ensure we increase wait time on subsequent backoffs	2025-10-13 08:53:26 +00:00
Calle Wilund	74578aaae2	utils::gcp::object_storage: Add timestamp to object listing	2025-10-13 08:53:26 +00:00
Calle Wilund	d0fe001518	utils::gcp::object_storage: Add paging support to list_objects Allowing listing N entries at a time, with continuation.	2025-10-13 08:53:26 +00:00
Calle Wilund	144b550e4f	object_storage_client: Add object_name wrapper type Remaining S3-centric, but abstracting the object name to possible implementations not quite formatted the same.	2025-10-13 08:53:25 +00:00
Calle Wilund	9dde8806dd	utils::gcp::object_storage: Add optional abort_source Add forwarded abort_source to lengty ops	2025-10-13 08:53:25 +00:00
Calle Wilund	926177dfb4	utils::rest::client: Add abort_source support Add optional forwarded abort_source	2025-10-13 08:53:25 +00:00
Calle Wilund	5d4558df3b	sstables: Use object_storage_client for remote storage Replaces direct s3 interfaces with the abstraction layer, and open for having multiple implentations/backends	2025-10-13 08:53:25 +00:00
Calle Wilund	25e932230a	sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial) Adds abstraction layer for creating the various ops and IO objects for storing sstable data on cloud storage	2025-10-13 08:53:25 +00:00
Calle Wilund	a2cd061a5d	s3::upload_progress: Promote to general util type	2025-10-13 08:53:25 +00:00
Calle Wilund	868d057aae	storage_options: Abstract s3 to "object_storage" and add gs as option Since both are bucket+prefix oriented, we can basically use same options for both, only distinguished by actual protocol. Abstract the types and the helper parse etc routines to handle either. Use "gs" as term for gcs (google compute storage), since this is the URL scheme used.	2025-10-13 08:53:25 +00:00
Calle Wilund	ac438c61e6	sstables::file_io_extension: Change "creator" callback to just data_source Because the concept of pushing reading range does not work for the wrapping we do (i.e. encryption), there is no point having it here. We need to do said range handling higher up. Also, must allow multi-layered wrapping.	2025-10-13 08:53:25 +00:00
Calle Wilund	2e49270da5	utils::io-wrappers: Add ranged data_source Provides a data_source wrapper to read a specific range of a source stream.	2025-10-13 08:53:25 +00:00
Calle Wilund	91c0467282	utils::io-wrappers: Add file wrapper type for seekable_source Provides a read-only file interface for a seekable_source object.	2025-10-13 08:53:25 +00:00
Calle Wilund	e62a18304e	utils::seekable_source: Add a seekable IO source type Extension of data_source, with the ability to a.) Seek in any direction, i.e. move backwards. Thus not pure stream. b.) Read a limited number of bytes. The very transparent reason for the interface is to have a base abstraction for providing a read-only file layer for networked resources.	2025-10-13 08:53:24 +00:00
Calle Wilund	14dada350a	object_storage_endpoint_param: Add gs storage as option	2025-10-13 08:53:24 +00:00
Calle Wilund	78d9dda060	config: break out object_storage_endpoint_param preparing for multi storage Moves the config wrapper to own file (to reduce recompilation for modifying) and refactors to handle extending this parameter to non-s3 endpoint configs.	2025-10-13 08:53:24 +00:00
Avi Kivity	c783f0e539	Merge 'index: Prefer const qualifiers wherever possible' from Dawid Mędrek We add missing `const`-qualifiers wherever possible in the module. A few smaller changes were included as a bonus. Backport: not needed. This is a cleanup. Closes scylladb/scylladb#26485 * github.com:scylladb/scylladb: index/secondary_index_manager: Take std::span instead of std::vector index/secondary_index_manager: Add missing const qualifier index/vector_index: Add missing const qualifiers cql3/statements/index_prop_defs.cc: Remove unused include cql3/statements/index_prop_defs.cc: Mark function as TU-local cql3/statements/index_prop_defs: Mark methods as const-qualified	2025-10-12 19:47:53 +03:00
Michał Chojnowski	93dac3d773	sstables/compressor: relax a large allocation warning in ZSTD_CDict creation ZSTD_CDict needs a big contiguous allocation and there's no way around that. The only thing to do is relax the warning appropriately. Closes scylladb/scylladb#25393	2025-10-12 18:21:11 +03:00
Botond Dénes	24c6476f73	mutation/mutation_compactor: add tombstone_gc_state to query ctor So tombstones can be purged correctly based on the tombstone gc mode. Currently if repair-mode is used, tombstones are not purged at all, which can lead to purged tombstone being re-replicated to replicas which already purged them via read-repair. This is not a correctness problem, tombstones are not included in data query resutl or digest, these purgable tombstone are only a nuissance for read repair, where they can create extra differences between replicas. Note that for the read repair to trigger, some difference other than in purgable tombstones has to exist, because as mentioned above, these are not included in digets. Fixes: scylladb/scylladb#24332 Closes scylladb/scylladb#26351	2025-10-12 17:48:15 +03:00
Botond Dénes	d9c3772e20	service/storage_proxy: send batches with CL=EACH_QUORUM Batches that fail on the initial send are retired later, until they succeed. These retires happen with CL=ALL, regardless of what the original CL of the batch was. This is unnecessarily strict. We tried to follow Cassandra here, but Cassandra has a big caveat in their use of CL=ALL for batches. They accept saving just a hint for any/all of the endpoints, so a batch which was just logged in hints is good enough for them. We do not plan on replicating this usage of hints at this time, so as a middle ground, the CL is changed to EACH_QUORUM. Fixes: scylladb/scylladb#25432 Closes scylladb/scylladb#26304	2025-10-12 17:18:41 +03:00
Michał Chojnowski	7c6e84e2ec	test/boost/sstable_compressor_factory_test: fix thread-unsafe usage of Boost.Test It turns out that Boost assertions are thread-unsafe, (and can't be used from multiple threads concurrently). This causes the test to fail with cryptic log corruptions sometimes. Fix that by switching to thread-safe checks. Fixes scylladb/scylladb#24982 Closes scylladb/scylladb#26472	2025-10-12 17:16:51 +03:00
Piotr Wieczorek	8cd9f5d271	test/alternator: Add a Streams test reproducing #26382 This commit adds a test that reproduces an issue, wherein OldImage isn't included in the REMOVE events produced by Alternator Streams. Refs https://github.com/scylladb/scylladb/issues/26382 Closes scylladb/scylladb#26383	2025-10-12 11:09:57 +03:00
Piotr Wieczorek	a55c5e9ec7	alternator: Correct RCU undercount in BatchGetItem The `describe_multi_item` function treated the last reference-captured argument as the number of used RCU half units. The caller `batch_get_item`, however, expected this parameter to hold an item size. This RCU value was then passed to `rcu_consumed_capacity_counter::get_half_units`, treating the already-calculated RCU integer as if it were a size in bytes. This caused a second conversion that undercounted the true RCU. During conversion, the number of bytes is divided by `RCU_BLOCK_SIZE_LENGTH` (=4KB), so the double conversion divided the number of bytes by 16 MB. The fix removes the second conversion in `describe_multi_item` and changes the API of `describe_multi_item`. Fixes: https://github.com/scylladb/scylladb/pull/25847 Closes scylladb/scylladb#25842	2025-10-12 10:42:32 +03:00
Karol Nowacki	62deea62a4	vector_search: Unify test timeouts The test previously used separate timeouts for requests (5s) and the overall test case (10s). This change unifies both timeouts to 10 seconds.	2025-10-10 16:49:06 +02:00
Karol Nowacki	0de1fb8706	vector_search: Fix missing timeout reset The `vector_store_client_test` could be flaky because the request timeout was not consistently reset in all code paths. This could lead to a timeout from a previous operation firing prematurely and failing the test. The fix ensures `abort_source_timeout` is reset before each request. The implementation is also simplified by changing `abort_source_timeout::reset` that combines the reset and arm operations into a same invocation.	2025-10-10 16:48:54 +02:00
Karol Nowacki	d99a4c3bad	vector_search: Refactor ANN request test Refactor the `vector_store_client_test_ann_request` test to use the `vs_mock_server` class, unifying the structure of the test cases. This change also removes retry logic that waited for the server to be ready. This is no longer necessary because the handler now exists for all index names and consumes the entire request payload, preventing connection closures. Previously, the server did not handle requests for unconfigured indexes, which caused the connection to close. This could lead to a race condition where the client would attempt to reuse a closed connection.	2025-10-10 16:48:20 +02:00
Karol Nowacki	2eb752e582	vector_search: Fix flaky connection in tests The vector store mock server was not reading the ANN request body, which could cause it to prematurely close the connection. This could lead to a race condition where the client attempts to reuse a closed connection from its pool, resulting in a flaky test. The fix is to always read the request body in the mock server.	2025-10-10 16:48:09 +02:00
Karol Nowacki	ac5e9c34b6	vector_search: Fix flaky test by mocking DNS queries The `vector_store_client_uri_update_to_invalid` test was flaky because it performed real DNS lookups, making it dependent on the network environment. This commit replaces the live DNS queries with a mock to make the test hermetic and prevent intermittent failures. `vector_search_metrics_test` test did not call configure{vs}, as a consequence the test did real DNS queries, which made the test flaky. The refreshes counter increment has been moved before the call to the resolver. In tests, the resolver is mocked leading to lack of increments in production code. Without this change, there is no way to test DNS counter increments. The change also simplifies the test making it more readable.	2025-10-10 16:47:03 +02:00
Patryk Jędrzejczak	5f68b9dc6b	test: test_raft_no_quorum: test_can_restart: deflake the read barrier call Expecting the group 0 read barrier to succeed with a timeout of 1s, just after restarting 3 out of 5 voters, turned out to be flaky. In some unlikely scenarios, such as multiple vote splits, the Raft leader election could finish after the read barrier times out. To deflake the test, we increase the timeout of Raft operations back to 300s for read barriers we expect to succeed. Fixes #26457 Closes scylladb/scylladb#26489	2025-10-10 15:22:39 +03:00
Asias He	13dd88b010	repair: Rename incremental mode name Using the name regular as the incremental mode could be confusing, since regular might be interpreted as the non-incremental repair. It is better to use incremental directly. Before: - regular (standard incremental repair) - full (full incremental repair) - disabled (incremental repair disabled) After: - incremental (standard incremental repair) - full (full incremental repair) - disabled (incremental repair disabled) Fixes #26503 Closes scylladb/scylladb#26504	2025-10-10 15:21:54 +03:00
Michał Chojnowski	85fd4d23fa	test_sstable_compression_dictionaries_basic: reconnect robustly after node reboots Using `driver_connect()` after a cluster restart isn't enough to ensure full CQL availability, but the test assumes that it is. Fix that by making the test wait for CQL availability via `get_ready_cql()`. Also, replace some manual usages of wait_for_cql_and_get_hosts with `get_ready_cql()` too. Fixes scylladb/scylladb#25362 Closes scylladb/scylladb#25366	2025-10-10 14:27:02 +03:00
Piotr Dulikowski	0b800aab17	Merge 'db/view/view_building_worker: move `discover_existing_staging_sstables()` to the foreground' from Michał Jadwiszczak db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground This patch moves `discover_existing_staging_sstables()` to be executed from main level, instead of running it on the background fiber. This method need to be run only once during the startup to collect existing staging sstables, so there is no need to do it in the background. This change will increase debugability of any further issues related to it (like https://github.com/scylladb/scylladb/issues/26403). Fixes https://github.com/scylladb/scylladb/issues/26417 The patch should be backported to 2025.4 Closes scylladb/scylladb#26446 * github.com:scylladb/scylladb: db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground db/view/view_building_worker: futurize and rename `start_background_fibers()`	2025-10-09 18:24:50 +02:00
Michał Jadwiszczak	8d0d53016c	db/view/view_building_worker: update state again if some batch was finished during the update There was a race between loop in `view_building_worker::run_view_building_state_observer()` and a moment when a batch was finishing its work (`.finally()` callback in `view_building_worker::batch::start()`). State observer waits on `_vb_state_machine.event` CV and when it's awoken, it takes group0 read apply mutex and updates its state. While updating the state, the observer looks at `batch::state` field and reacts to it accordingly. On the other hand, when a batch finishes its work, it sets `state` field to `batch_state::finished` and does a broadcast on `_vb_state_machine.event` CV. So if the batch will execute the callback in `.finally()` while the observer is updating its state, the observer may miss the event on the CV and it will never notice that the batch was finished. This patch fixes this by adding a `some_batch_finished` flag. Even if the worker won't see an event on the CV, it will notice that the flag was set and it will do next iteration. Fixes scylladb/scylladb#26204 Closes scylladb/scylladb#26289	2025-10-09 18:17:22 +02:00
Avi Kivity	55d4d39ae3	Merge 'transport: service_level_controller: create and use driver service level' from Andrzej Jackowski This is a cherry-pick of https://github.com/scylladb/scylladb/pull/25412 commits, as the changes were reverted in 364316dd2f2212bbbb446eaa2a4b0bd53d125ad5 due to https://github.com/scylladb/scylladb/issues/26163. The underlying problem (https://github.com/scylladb/scylladb/issues/26190) was fixed in seastar (https://github.com/scylladb/seastar/pull/2994), so https://github.com/scylladb/scylladb/pull/25412 commits are restored without changes (only rebase conflicts were resolved). === This patch series: - Increases the number of allowed scheduling groups to allow creation of `sl:driver` - Implements `create_driver_service_level` that creates `sl:driver` with shares=200 if it wasn't already created - Implements creation of `sl:driver` for new systems and tests in `raft_initialize_discovery_leader` - Modifies `topology_coordinator` to use create `sl:driver` after upgrades. - Implements using `sl:driver` for new connections in `transport/server` - Adds to `transport/server` recognition of driver's control connections and forcing them to keep using `sl:driver`. - Adds tests to verify the new functionality - Modifies existing tests to let them pass after `sl:driver` is added - Modifies the documentation to contain new `sl:driver` The changes were evaluated by a test with the following scenario ([test_connections-sl-driver.py](https://github.com/user-attachments/files/22021273/test_connections-sl-driver.py)): - Start ScyllaDB with one node - Create 1000 keyspaces, 1 table in each keyspace - Start `cassandra-stress` (`-rate threads=50 -mode native cql3`) - Run connection storm with 1000 session (100 python processes, 10 sessions each) The maximum latency during connection storm dropped from 224.94ms to 41.43ms (those numbers are average from 20 test executions, were max latency was in [140ms, 361ms] before change and [31.4ms, 61.5ms] after). The snippet of cassandra-stress output from the moment of connection storm: Before: ``` type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb ... total, 789206, 85887, 85887, 85887, 0.6, 0.3, 2.0, 2.0, 2.5, 5.0, 9.0, 0.09679, 0, 0, 0, 0, 0, 0 total, 909322, 120116, 120116, 120116, 0.4, 0.2, 1.9, 2.0, 2.1, 3.1, 10.0, 0.09053, 0, 0, 0, 0, 0, 0 total, 964392, 55070, 55070, 55070, 0.9, 0.4, 2.0, 4.5, 7.7, 18.9, 11.0, 0.09203, 0, 0, 0, 0, 0, 0 total, 975705, 11313, 11313, 11313, 4.4, 3.5, 6.5, 24.5, 82.7, 83.0, 12.0, 0.11713, 0, 0, 0, 0, 0, 0 total, 987548, 11843, 11843, 11843, 4.2, 3.5, 6.5, 33.7, 48.6, 51.5, 13.0, 0.13366, 0, 0, 0, 0, 0, 0 total, 995422, 7874, 7874, 7874, 6.3, 4.0, 7.7, 85.6, 112.9, 113.5, 14.0, 0.14753, 0, 0, 0, 0, 0, 0 total, 1007228, 11806, 11806, 11806, 4.3, 3.5, 6.5, 29.1, 43.8, 87.1, 15.0, 0.15598, 0, 0, 0, 0, 0, 0 total, 1012840, 5612, 5612, 5612, 8.2, 5.0, 11.5, 121.8, 166.6, 170.1, 16.0, 0.16535, 0, 0, 0, 0, 0, 0 total, 1016186, 3346, 3346, 3346, 13.4, 7.4, 20.1, 204.9, 207.6, 210.4, 17.0, 0.17405, 0, 0, 0, 0, 0, 0 total, 1025462, 9276, 9276, 9276, 6.3, 3.9, 9.6, 74.6, 206.8, 210.0, 18.0, 0.17800, 0, 0, 0, 0, 0, 0 total, 1035979, 10517, 10517, 10517, 4.8, 3.5, 6.7, 38.5, 82.6, 83.0, 19.0, 0.18120, 0, 0, 0, 0, 0, 0 total, 1047488, 11509, 11509, 11509, 4.3, 3.5, 6.0, 32.6, 72.3, 74.0, 20.0, 0.18334, 0, 0, 0, 0, 0, 0 total, 1077456, 29968, 29968, 29968, 1.7, 1.6, 2.9, 3.6, 7.0, 8.2, 21.0, 0.17943, 0, 0, 0, 0, 0, 0 total, 1105490, 28034, 28034, 28034, 1.8, 1.8, 3.5, 4.6, 5.3, 13.8, 22.0, 0.17609, 0, 0, 0, 0, 0, 0 total, 1132221, 26731, 26731, 26731, 1.9, 1.8, 3.8, 5.2, 8.4, 11.1, 23.0, 0.17314, 0, 0, 0, 0, 0, 0 total, 1162149, 29928, 29928, 29928, 1.7, 1.7, 3.0, 4.5, 8.0, 9.1, 24.0, 0.16950, 0, 0, 0, 0, 0, 0 ... ``` After: ``` type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb ... total, 822863, 94379, 94379, 94379, 0.5, 0.3, 2.0, 2.0, 2.1, 3.7, 9.0, 0.06669, 0, 0, 0, 0, 0, 0 total, 937337, 114474, 114474, 114474, 0.4, 0.2, 2.0, 2.0, 2.1, 3.4, 10.0, 0.06301, 0, 0, 0, 0, 0, 0 total, 986630, 49293, 49293, 49293, 1.0, 1.0, 2.0, 2.1, 17.9, 19.0, 11.0, 0.07318, 0, 0, 0, 0, 0, 0 total, 1026734, 40104, 40104, 40104, 1.2, 1.0, 2.0, 2.2, 6.3, 7.1, 12.0, 0.08410, 0, 0, 0, 0, 0, 0 total, 1066124, 39390, 39390, 39390, 1.3, 1.0, 2.0, 2.2, 2.6, 3.4, 13.0, 0.09108, 0, 0, 0, 0, 0, 0 total, 1103082, 36958, 36958, 36958, 1.3, 1.1, 2.1, 2.5, 3.1, 4.2, 14.0, 0.09643, 0, 0, 0, 0, 0, 0 total, 1141987, 38905, 38905, 38905, 1.3, 1.0, 2.0, 2.4, 11.4, 12.7, 15.0, 0.09894, 0, 0, 0, 0, 0, 0 total, 1180023, 38036, 38036, 38036, 1.3, 1.0, 2.0, 3.7, 5.6, 7.1, 16.0, 0.10070, 0, 0, 0, 0, 0, 0 total, 1216481, 36458, 36458, 36458, 1.4, 1.0, 2.1, 3.6, 4.7, 5.0, 17.0, 0.10210, 0, 0, 0, 0, 0, 0 total, 1256819, 40338, 40338, 40338, 1.2, 1.0, 2.0, 2.2, 3.5, 5.4, 18.0, 0.10173, 0, 0, 0, 0, 0, 0 total, 1295122, 38303, 38303, 38303, 1.3, 1.0, 2.0, 2.4, 21.0, 21.1, 19.0, 0.10136, 0, 0, 0, 0, 0, 0 total, 1334743, 39621, 39621, 39621, 1.3, 1.0, 2.0, 2.3, 3.3, 4.0, 20.0, 0.10055, 0, 0, 0, 0, 0, 0 total, 1375579, 40836, 40836, 40836, 1.2, 1.0, 2.0, 2.1, 3.4, 5.7, 21.0, 0.09927, 0, 0, 0, 0, 0, 0 total, 1415576, 39997, 39997, 39997, 1.2, 1.0, 2.0, 2.3, 3.2, 4.1, 22.0, 0.09807, 0, 0, 0, 0, 0, 0 total, 1449268, 33692, 33692, 33692, 1.5, 1.4, 2.5, 3.2, 4.2, 5.6, 23.0, 0.09800, 0, 0, 0, 0, 0, 0 total, 1471873, 22605, 22605, 22605, 2.2, 2.0, 4.8, 5.9, 7.0, 7.9, 24.0, 0.10015, 0, 0, 0, 0, 0, 0 ... ``` Fixes: https://github.com/scylladb/scylladb/issues/24411 This is a new feature, so no backport needed. Closes scylladb/scylladb#26411 * github.com:scylladb/scylladb: docs: workload-prioritization: add driver service level test: add test to verify use of `sl:driver` transport: use `sl:driver` to handle driver's control connections transport: whitespace only change in update_scheduling_group transport: call update_scheduling_group for non-auth connections generic_server: transport: start using `sl:driver` for new connections test: add test_desc_* for driver service level test: service_levels: add tests for sl:driver creation and removal test: add reload_raft_topology_state() to ScyllaRESTAPIClient service_level_controller: automatically create `sl:driver` service_level_controller: methods to create driver service level service_level_controller: handle special sl:driver in DESC output topology_coordinator: add service_level_controller reference system_keyspace: add service_level_driver_created test: add MAX_USER_SERVICE_LEVELS	2025-10-09 17:28:39 +03:00
Dawid Mędrek	ecc955fbe0	index/secondary_index_manager: Take std::span instead of std::vector	2025-10-09 16:17:07 +02:00
Dawid Mędrek	074f0f2e4c	index/secondary_index_manager: Add missing const qualifier	2025-10-09 16:06:50 +02:00
Dawid Mędrek	7baf95bc4b	index/vector_index: Add missing const qualifiers	2025-10-09 16:06:24 +02:00
Dawid Mędrek	4486ac0891	cql3/statements/index_prop_defs.cc: Remove unused include	2025-10-09 16:01:56 +02:00
Dawid Mędrek	d50c2f7c74	cql3/statements/index_prop_defs.cc: Mark function as TU-local	2025-10-09 16:00:44 +02:00
Dawid Mędrek	89b3d0c582	cql3/statements/index_prop_defs: Mark methods as const-qualified	2025-10-09 15:53:29 +02:00
Avi Kivity	bb02295695	setup: add the lazytime XFS mount option In `f828fe0d59` ("setup: add the lazytime XFS version") we added the lazytime mount option to /var/lib/scylla, but it was quickly reverted (`8f5e80e61a`) as it caused a regression on CentOS 7. We reinstate it now with a kernel version check. This will avoid the lazytime mount option on CentOS 7, which is unsupported anyway. The lazytime option avoids marking the inode as dirty if it's only for the purpose of updating mtime/ctime. This won't help much while writing sstables (since the write also updates extent information), but may help a little with with commitlog writes, since those are pure overwrites. It likely won't help with the RWF_NOWAIT violations seen in [1], since those are likely due to in-memory locking, not flushing dirty inodes to disk. Tested with an install to Ubuntu 24.04 LTS followed by a scylla_setup run. The lazytime option was added the the .mount file and showed up in the live mount. [1] https://github.com/scylladb/seastar/issues/2974 Closes scylladb/scylladb#26436 Fixes #26002	2025-10-09 15:55:58 +03:00
Ernest Zaslavsky	c2bab430d7	s3_client: fix `when` condition to prevent infinite locking Refine condition variable predicate in filling fiber to avoid indefinite waiting when `close` is invoked. Closes scylladb/scylladb#26449	2025-10-09 15:55:37 +03:00
Piotr Wieczorek	b54ad9e22f	storage: Add cdc options to cas_request::apply	2025-10-09 12:28:10 +02:00
Piotr Wieczorek	2c1e699864	cdc, storage: Add a struct to pass per-mutation options to CDC This will allow us to communicate with CDC from higher layers. We plan to use it to reduce the number of read-before-writes with preimages by passing the row selected in upper layers.	2025-10-09 12:28:10 +02:00
Piotr Wieczorek	66935bedac	cdc: Move operations enum to the top of the namespace	2025-10-09 12:28:10 +02:00
Michał Chojnowski	c35b82b860	test/cluster/test_bti_index.py: avoid a race with CQL tracing The test uses CQL tracing to check which files were read by a query. This is flaky if the coordinator and the replica are different shards, because the Python driver only waits for the coordinator, and not for replicas, to finish writing their traces. (So it might happen that the Python driver returns a result with only coordinator events and no replica events). Let's just dodge the issue by using --smp=1. Fixes scylladb/scylladb#26432 Closes scylladb/scylladb#26434	2025-10-09 13:22:06 +03:00
Michał Chojnowski	87e3027c81	docs: fix a parameter name in API calls in sstable-dictionary-compression.rst The correct argument name is `cf`, not `table`. Fixes scylladb/scylladb#25275 Closes scylladb/scylladb#26447	2025-10-09 13:18:47 +03:00
Robert Bindar	2c74a6981b	Make scylla_io_setup detect request size for best write IOPS We noticed during work on scylladb/seastar#2802 that on i7i family (later proved that it's valid for i4i family as well), the disks are reporting the physical sector sizes incorrectly as 512bytes, whilst we proved we can render much better write IOPS with 4096bytes. This is not the case on AWS i3en family where the reported 512bytes physical sector size is also the size we can achieve the best write IOPS. This patch works around this issue by changing `scylla_io_setup` to parse the instance type out of `/sys/devices/virtual/dmi/id/product_name` and run iotune with the correct request size based on the instance type. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#25315	2025-10-08 14:30:52 +03:00
Piotr Dulikowski	fe7ffc5e5d	Merge 'service/qos: set long timeout for auth queries on SL cache update' from Michael Litvak pass an appropriate query state for auth queries called from service level cache reload. we use the function qos_query_state to select a query_state based on caller context - for internal queries, we set a very long timeout. the service level cache reload is called from group0 reload. we want it to have a long timeout instead of the default 5 seconds for auth queries, because we don't have strict latency requirement on the one hand, and on the other hand a timeout exception is undesired in the group0 reload logic and can break group0 on the node. Fixes https://github.com/scylladb/scylladb/issues/25290 backport possible to improve stability Closes scylladb/scylladb#26180 * github.com:scylladb/scylladb: service/qos: set long timeout for auth queries on SL cache update auth: add query_state parameter to query functions auth: refactor query_all_directly_granted	2025-10-08 12:37:01 +02:00
Michał Jadwiszczak	84e4e34d81	db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground This patch moves `discover_existing_staging_sstables()` to be executed from main level, instead of running it on the background fiber. This method need to be run only once during the startup to collect existing staging sstables, so there is no need to do it in the background. This change will increase debugability of any further issues related to it (like scylladb/scylladb#26403). Fixes scylladb/scylladb#26417	2025-10-08 11:16:07 +02:00
Michał Jadwiszczak	575dce765e	db/view/view_building_worker: futurize and rename `start_background_fibers()` Next commit will move `discover_existing_staging_sstables()` to the foreground, so to prepare for this we need to futurize `start_background_fibers()` method and change its name to better reflect its purpose.	2025-10-08 10:19:41 +02:00
Andrzej Jackowski	0072b75541	docs: workload-prioritization: add driver service level Refs: scylladb/scylladb#24411	2025-10-08 08:25:38 +02:00
Andrzej Jackowski	f720ce0492	test: add test to verify use of `sl:driver` `sl:driver` is expected to be used for new and control connections, but other connections that run user load should not use it after the user is authenticated. Refs: scylladb/scylladb#24411	2025-10-08 08:25:33 +02:00
Andrzej Jackowski	f99b8c4a55	transport: use `sl:driver` to handle driver's control connections Before `sl:driver` was introduced, service levels were assigned as follows: 1. New connections were processed in `main`. 2. After user authentication was completed, the connection's SL was changed to the user's SL (or `sl:default` if the user had no SL). This commit introduces `service_level_state` to `client_state` and implements the following logic in `transport/server`: 1. If `sl:driver` is not present in the system (for example, it was removed), service levels behave as described above. 2. If `sl:driver` is present, the flow is: I. New connections use `sl:driver`. II. After user authentication is completed, the connection's SL is changed to the user's SL (or `sl:default`). III. If a REGISTER (to events) request is handled, the client is processing the control connection. We mark the client_state to permanently use `sl:driver`. The aforementioned state `2.III` is represented by `_control_connection` flag in `client_state`. Fixes: scylladb/scylladb#24411	2025-10-08 08:25:28 +02:00
Andrzej Jackowski	fd36bc418a	transport: whitespace only change in update_scheduling_group The indentation is changed because it will be required in the next commit of this patch series.	2025-10-08 08:25:22 +02:00
Andrzej Jackowski	278019c328	transport: call update_scheduling_group for non-auth connections Before this change, unauthorized connections stayed in `main` scheduling group. It is not ideal, in such case, rather `sl:default` should be used, to have a consistent behavior with a scenario where users is authenticated but there is no service level assigned to the user. This commit adds a call to `update_scheduling_group` at the end of connection creation for an unauthenticated user, to make sure the service level is switched to `sl:default`. Fixes: scylladb/scylladb#26040	2025-10-08 08:25:17 +02:00
Andrzej Jackowski	14081d0727	generic_server: transport: start using `sl:driver` for new connections Before this change, new connections were handled in a default scheduling group (`main`), because before the user is authenticated we do not know which service level should be used. With the new `sl:driver` service level, creation of new connections can be moved to `sl:driver`. We switch the service level as early as possible, in `do_accepts`. There is a possibility, that `sl:driver` will not exist yet, for instance, in specific upgrade cases, or if it was removed. Therefore, we also switch to `sl:driver` after a connection is accepted. Refs: scylladb/scylladb#24411	2025-10-08 08:25:12 +02:00
Andrzej Jackowski	b62135f767	test: add test_desc_* for driver service level Driver service level is a special service level that is created automatically by the system. Therefore, it requires special handling in DESC SCHEMA WITH INTERNALS and those test verifies the special behavior. Refs: scylladb/scylladb#24411	2025-10-08 08:25:07 +02:00
Andrzej Jackowski	0ddf46c7b4	test: service_levels: add tests for sl:driver creation and removal Refs: scylladb/scylladb#24411	2025-10-08 08:25:02 +02:00
Andrzej Jackowski	9e9bca9bdb	test: add reload_raft_topology_state() to ScyllaRESTAPIClient To encapsulate `/storage_service/raft_topology/reload` API call	2025-10-08 08:24:57 +02:00
Andrzej Jackowski	c59a7db1c9	service_level_controller: automatically create `sl:driver` This commit: - Increases the number of allowed scheduling groups to allow the creation of `sl:driver`. - Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating `sl:driver` until all nodes have increased the number of scheduling groups. - Starts using `get_create_driver_service_level_mutations` to unconditionally create `sl:driver` on `raft_initialize_discovery_leader`. The purpose of this code path is ensuring existence of `sl:driver` in new system and tests. - Starts using `migrate_to_driver_service_level` to create `sl:driver` if it is not already present. The creation of `sl:driver` is managed by `topology_coordinator`, similar to other system keyspace updates, such as the `view_builder` migration. The purpose of this code path is handling upgrades. - Modifies related tests to pass after `sl:driver` is added. Later in this patch series, `sl:driver` will be used by `transport/server` to handle selected traffic, such as the driver's schema and topology fetches. Refs: scylladb/scylladb#24411	2025-10-08 08:24:43 +02:00
Andrzej Jackowski	923559f46a	service_level_controller: methods to create driver service level This commit implements `get_create_driver_service_level_mutations` and `migrate_to_driver_service_level` in service_level_controller. Both methods create `sl:driver` with shares=200 and store this fact in `system.scylla_local`. Both methods will be used later in this patch series for automatic creation of sl:driver. Refs: scylladb/scylladb#24411	2025-10-08 08:24:38 +02:00
Andrzej Jackowski	2d296a2f9b	service_level_controller: handle special sl:driver in DESC output Later in this patch series, `sl:driver` will be added as a special service level created automatically by the system. It needs special handling in `DESC SCHEMA ...` to ensure that during backup restore: 1. CREATE SERVICE LEVEL does not fail if `sl:driver` already exists 2. If `sl:driver` exists, its configuration is fully restored (emit ALTER SERVICE LEVEL). 3. If `sl:driver` was removed, the information is retained (emit DROP SERVICE LEVEL instead of CREATE/ALTER). Refs: scylladb/scylladb#24411	2025-10-08 08:24:33 +02:00
Andrzej Jackowski	1ff605005e	topology_coordinator: add service_level_controller reference This adds a reference to sl_controller so that, later in this patch series, topology_coordinator can manage creating `sl:driver` once group0 is fully operational. Refs: scylladb/scylladb#24411	2025-10-08 08:24:28 +02:00
Andrzej Jackowski	8953f96609	system_keyspace: add service_level_driver_created This commit extends sytem.scylla_local table with an additional key/value pair that can be used later in this patch series to keep an information that `sl:driver` was already created. The purpose of storing this information is to ensure that `sl:driver` is not recreated after being intentionally removed. A new mutation is included in `register_raft_pull_snapshot` to keep `service_level_driver_created` in state machine shapshot, which is required for proper propagation of the value when a new node is added to the cluster. Refs: scylladb/scylladb#24411	2025-10-08 08:24:23 +02:00
Andrzej Jackowski	7d2db37831	test: add MAX_USER_SERVICE_LEVELS Previously, tests used the hardcoded value 7 for the maximum number of user service levels. This commit introduces a named variable that can be shared across tests to avoid cases where this magic number goes out of sync.	2025-10-08 08:24:17 +02:00
Patryk Jędrzejczak	d391b9d7d9	test: assert that majority is lost in some tests of the recovery procedure The voter handler caused `test_raft_recovery_user_data` to stop losing group 0 majority when expected. We make sure this won't happen again in this commit. We don't change `test_raft_recovery_entry_lose` because it has some checks that would fail with group 0 majority (schema versions would match). Note that it's possible to timeout the read barrier quickly without the `timeout` parameter. See e.g. `test_cannot_add_new_node` in `test_raft_no_quorum.py`. We don't take this approach here because we don't want to change the default Raft parameters in the recovery procedure tests.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	d623844c1c	test: rest_client: add timeout support for read_barrier Scylla already handles the `timeout` parameter, so the change is simple. We use the `timeout` parameter in the following commit.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	91c8466f47	test: test_raft_recovery_user_data: lose majority when killing one dc After introducing the voter handler, the test stopped losing group 0 majority when expected because the killed dc contained 2 out of 5 voters. We fix it in this commit. The fix relies on the voter handler not doing unnecessary work. The first dc should keep its voters and majority. The test was functional even though majority wasn't lost when expected. Stopping the recovery leader before restarting it with `recovery_leader` caused majority loss in the old group 0. Hence, there is no need to backport this commit.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	c8a5e7a74e	test: test_raft_recovery_user_data: shutdown driver sessions Shutting down `ccluster_all_nodes` in the previous commit is necessary to avoid flakiness. It turns out that leaked driver sessions can impact another run of the test case (with different parameterization). Here, without shutting down `ccluster_all_nodes`, we could observe the DDL requests from `start_writes` fail in the second test case run (where `remove_dead_nodes_with == "replace"`) like this: ``` > await cql.run_async(f"USE {ks_name}") E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.46.35.70:9042 dc1>: ConnectionException('Host has been marked down or removed'), <Host: 127.46.35.71:9042 dc1>: ConnectionException('Host has been marked down or removed'), <Host: 127.46.35.3:9042 dc1>: ConnectionException('Host has been marked down or removed'), <Host: 127.46.35.25:9042>: ConnectionException('Host has been marked down or removed')}) ``` We could also see errors like this on the driver: ``` cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace 'test_1759763911381_oktks' does not exist" ``` It turned out that `test_1759763911381_oktks` was created in the first test case run (where `remove_dead_nodes_with == "remove"), and somehow the driver session created in the second test case run was still using this keyspace in some way. The DDL requests were failing on the Scylla side with the error above, and after some retries, the driver marked nodes as down. I didn't try to investigate what exactly the driver was doing. In this commit, we shut down other driver sessions used in this test. They didn't cause problems so far, but we'd better use the Python driver correctly and be safe.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	a35740cbe8	test: test_raft_recovery_user_data: use a separate driver connection for the write workload It's simpler than pausing the workload for the `cql` reconnection. Moreover, the removed `start_writes` call required group 0 majority for (redundant) CREATE KEYSPACE IF NOT EXISTS and CREATE TABLE IF NOT EXISTS statements. The test shouldn't have group 0 majority at that point, which is fixed in one of the following commits. Using a separate driver connection also allows us to call `finish_writes()` a bit later, after the `cql` reconnection.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	d1a944251e	test: test_raft_recovery_user_data: send ALTER KEYSPACE to any node We have the global request queue now, so we can't hit "Another global topology request is ongoing, please retry." anymore.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	9a98febac5	test: test_raft_recovery_user_data: bring failure_detector_timeout_in_ms back to 20 s It looks like decreasing `failure_detector_timeout_in_ms` doesn't make the shutdown faster anymore. We had some changes related to requests during shutdown like #24499 and #24714. They are probably the reason.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	2b1d7f0e83	test: test_raft_recovery_user_data: speed up replace operations	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	dbd998bc15	test: stop/start servers concurrently in the recovery procedure tests This change makes these tests a bit faster.	2025-10-07 17:48:51 +02:00
Dawid Mędrek	a9577e4d52	replica/database: Fix description of `validate_tablet_views_indexes` The current description is not accurate: the function doesn't throw an exception if there's an invalid materialized view. Instead, it simply logs the keyspaces that violate the requirement. Furthermore, the experimental feature `views-with-tablets` is no longer necessary for considering a materialized view as valid. It was dropped in scylladb/scylladb@b409e85c20. The replacement for it is the cluster feature `VIEWS_WITH_TABLETS`. Fixes scylladb/scylladb#26420 Closes scylladb/scylladb#26421	2025-10-07 17:39:43 +02:00
Artsiom Mishuta	99455833bd	test.py: reintroducing sudo in resource_gather.py conditionally reintroducing sudo for resource gathering when running under docker related: https://github.com/scylladb/scylladb/pull/26294#issuecomment-3346968097 fixes: https://github.com/scylladb/scylladb/issues/26312 Closes scylladb/scylladb#26401	2025-10-07 14:42:15 +02:00
Piotr Dulikowski	264cf12b66	Merge 'view building coordinator - add missing tests' from Michał Jadwiszczak This patch adds tests for: - tablet migration during view building - tablet merge during view building. Those tests were missing from the original testing plan. We want to backport it to 2025.4 to ensure the release is bug-free. Closes scylladb/scylladb#26414 * github.com:scylladb/scylladb: test/cluster/test_view_building_coordinator: add test for tablet merge test/cluster/test_view_building_coordinator: add test for tablet migration	2025-10-07 14:25:04 +02:00
Botond Dénes	8b0bfb817e	Merge 'Switch REST API server to use content-streaming' from Pavel Emelyanov Seastar httpd recommended users to stop using contiguous requet.content string and read body they need from request's input_stream instead. However, "official" deprecation of request content had been only made recently. This PR patches REST API server to turn this feature on and patches few handlers that mess with request bodies to read them from request stream. Using newer seastar API, no need to backport Closes scylladb/scylladb#26418 * github.com:scylladb/scylladb: api: Switch to request content streaming api: Fix indentation after previous patch api: Coroutinize set_relabel_config handler api: Coroutinize set_error_injection handler	2025-10-07 14:13:47 +03:00
Michał Chojnowski	3cf51cb9e8	sstables: fix some typos in comments I added those typos recently, and spellcheckers complain. Closes scylladb/scylladb#26376	2025-10-07 13:20:06 +03:00
Botond Dénes	8beea931be	Merge 'Remove system_keyspace from column_family API' from Pavel Emelyanov This dependency reference is carried into column_family handlers block to make get_built_views handler work. However, the handler in question should live in view_builder block, because it works with v.b. data. This PR moves the handler there, while at it, coroutinizes it, and removes the no longer needed sys.ks. reference from column_family. API dependencies cleanup work, no need to backport Closes scylladb/scylladb#26381 * github.com:scylladb/scylladb: api: Fix indentation after previous patch api: Coroutinize get_built_indexes handler code api: Remove system_keyspace ref from column_family API block api: Move get_built_indexes from column_family to view_builder	2025-10-07 13:07:46 +03:00
Pavel Emelyanov	ed1c049c3b	scripts: Add usage to pull_github_pr script If mis-used, the script says error: unrecognized option: ..., see ./scripts/pull_github_pr.sh -h for usage but if using the suggested -h option it prints just the same. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26378	2025-10-07 10:28:35 +03:00
Lakshmi Narayanan Sreethar	b8042b66e3	cmake: replace -fvisibility=hidden compiler flag with -fvisibility-inlines-hidden The PR #26154 dropped the `-fvisibility=hidden` compiler flag and replaced it with `-fvisibility-inlines-hidden` as the former caused issues in how the `noncopyable_function::operator bool` method executed leading to incorrect return values. Apply the same fix to cmake. Fixes #26391 Closes scylladb/scylladb#26431	2025-10-07 10:10:47 +03:00
Pavel Emelyanov	127afd4da1	api: Switch to request content streaming There are three handler that need to be patched all at once with the server itself being marked with set_content_streaming For two simple handler just get the content string with read_entire_stream_contiguous helper. This is what httpd server did anyway. The "start_restore" handler used the contiguous contents to parse json from using rjson utility. This handler is patched to use read_entire_stream() that returns a vector of temporary buffers. The rjson parser has a helper to pars from that vector, so the change is also optimization. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-06 16:43:26 +03:00
Pavel Emelyanov	2cfccdac5c	api: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-06 16:43:12 +03:00
Pavel Emelyanov	5668058cb0	api: Coroutinize set_relabel_config handler Without the invoke_on_all lambda, for simplicity Also keep indentation "broken" for the ease of review Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-06 16:42:30 +03:00
Pavel Emelyanov	5017a25c00	api: Coroutinize set_error_injection handler Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-06 16:42:14 +03:00
Patryk Jędrzejczak	67d48a459f	raft topology: make the voter handler consider only group 0 members In the Raft-based recovery procedure, we create a new group 0 and add live nodes to it one by one. This means that for some time there are nodes which belong to the topology, but not to the new group 0. The voter handler running on the recovery leader incorrectly considers these nodes while choosing voters. The consequences: - misleading logs, for example, "making servers {<ID of a non-member>} voters", where the non-member won't become a voter anyway, - increased chance of majority loss during the recovery procedure, for example, all 3 nodes that first joined the new group 0 are in the same dc and rack, but only one of them becomes a voter because the voter handler tries to make non-members in other dcs/racks voters. Fixes #26321 Closes scylladb/scylladb#26327	2025-10-06 16:27:47 +03:00
Pavel Emelyanov	8002ddf946	code: Use tls_options::bye_timeout instead of deprecated switch Some code wants its TLS sockets to close immediately without sending BYE message and waiting for the response. Recent seastar update changed the way this functionality is requested (scylladb/seastar#2986) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26253	2025-10-06 16:25:35 +03:00
Michał Jadwiszczak	279a8cbba3	test/cluster/test_view_building_coordinator: add test for tablet merge The test pauses processing of the view building task and triggers tablet merge.	2025-10-06 15:06:11 +02:00
Michał Jadwiszczak	fc7e5370a1	test/cluster/test_view_building_coordinator: add test for tablet migration The test pauses processing of the view building task and migrates it to another node.	2025-10-06 15:02:42 +02:00
Michał Chojnowski	3b338e36c2	utils/config_file: fix a missing `allowed_values` propagation in one of `named_value` constructors In one of the constructors of `named_value`, the `allowed_values` argument isn't used. (This means that if some config entry uses this constructor, the values aren't validated on the config layer, and might give some lower layer a bad surprise). Fix that. Fixes scyllladb/scylladb#26371 Closes scylladb/scylladb#26196	2025-10-06 15:33:11 +03:00
Michał Chojnowski	dbddba0794	sstables/trie: actually apply BYPASS CACHE to index reads BYPASS CACHE is implemented for `bti_index_reader` by giving it its own private `cached_file` wrappers over Partitions.db and Rows.db, instead of passing it the shared `cached_file` owned by the sstable. But due to an oversight, the private `cached_file`s aren't constructed on top of the raw Partitions.db and Rows.db files, but on top of `cached_file_impl` wrappers around those files. Which means that BYPASS CACHE doesn't actually do its job. Tests based on `scylla_index_page_cache_*` metrics and on CQL tracing still see the reads from the private files as "cache misses", but those misses are served from the shared cached files anyway, so the tests don't see the problem. In this commit we extend `test_bti_index.py` with a check that looks at reactor's `io_queue` metrics instead, and catches the problem. Fixes scylladb/scylladb#26372 Closes scylladb/scylladb#26373	2025-10-06 15:32:05 +03:00
Andrzej Jackowski	c3dd383e9e	test: add reproduction of name reuse bug to service level tests This commit adds a reproduction test for scylladb/scylladb#26190 to the service levels test suite. Although the bug was fixed internally in Seastar, the corner-case service level name reuse scenario should be covered by tests to prevent regressions. Refs: https://github.com/scylladb/scylladb/issues/26190 Closes scylladb/scylladb#26379	2025-10-06 14:19:22 +02:00
Piotr Dulikowski	380f243986	Merge ' Support replication factor rack list for tablet-based keyspaces' from Tomasz Grabiec This change extends the CQL replication options syntax so the replication factor can be stated as a list of rack names. For example: { 'mydatacenter': [ 'myrack1', 'myrack2', 'myrack4' ] } Rack-list based RF can coexist with the old numerical RF, even in the same keyspace for different DCs. Specifying the rack list also allows to add replicas on the specified racks (increasing the replication factor), or decommissioning certain racks from their replicas (by omitting them from the current datacenter rack-list). This will allow us to keep the keyspace rf-rack-valid, maintaining guarantees, while allowing adding/removing racks. In particular, this will allow us to add a new DC, which happens by incrementally increasing RF in that DC to cover existing racks. Migration from numerical RF to rack-list is not supported yet. Migration from rack-list to numerical RF is not planned to be supported. New feature, no backport required. Co-authored with @bhalevy Fixes https://github.com/scylladb/scylladb/issues/25269 Fixes https://github.com/scylladb/scylladb/issues/23525 Closes scylladb/scylladb#26358 * github.com:scylladb/scylladb: tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count locator: Make hasher for endpoint_dc_rack globally accessible test: tablets: Add test for replica allocation on rack list changes test: lib: topology_builder: generate unique rack names test: Add tests for rack list RF doc: Document rack-list replication factor topology_coordinator: Restore formatting topology_coordinator: Cancel keyspace alter on broader set of errors topology_coordinator: Make keyspace alter process options through as_ks_metadata_update() cql3: ks_prop_defs: Preserve old options cql3: ks_prop_defs: Introduce flattened() locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace() tablet_allocator: Respect binding replicas to racks locator: network_topology_strategy: Respect rack list when reallocating tablets cql3: ks_prop_defs: Fail with more information when options are not in expected format locator, cql3: Support rack lists in replication options cql3: Fail early on vnode/tablet flavor alter cql3: Extract convert_property_map() out of Cql.g schema: Use definition from the header instead of open-coding it locator: Abstract obtaining the number of replicas from replication_strategy_config_option cql3, locator: Use type aliases for option maps locator: Add debug logging locator: Pass topology to replication strategy constructor abstract_replication_strategy, network_topology_strategy: add replication_factor_data class	2025-10-06 14:14:09 +02:00
Piotr Dulikowski	e7907b173a	Merge 'db/view: Require rf_rack_valid_keyspaces when creating materialized view' from Dawid Mędrek Materialized views are currently in the experimental phase and using them in tablet-based keyspaces requires starting Scylla with an experimental feature, `views-with-tablets`. Any attempts to create a materialized view or secondary index when it's not enabled will fail with an appropriate error. After considerable effort, we're drawing close to bringing views out of the experimental phase, and the experimental feature will no longer be needed. However, materialized views in tablet-based keyspaces will still be restricted, and creating them will only be possible after enabling the configuration option `rf_rack_valid_keyspaces`. That's what we do in this PR. In this patch, we adjust existing tests in the tree to work with the new restriction. That shouldn't have been necessary because we've already seemingly adjusted all of them to work with the configuration option, but some tests hid well. We fix that mistake now. After that, we introduce the new restriction. What's more, when starting Scylla, we verify that there is no materialized view that would violate the contract. If there are some that do, we list them, notify the user, and refuse to start. High-level implementation strategy: 1. Name the restrictions in form of a function. 2. Adjust existing tests. 3. Restrict materialized views by both the experimental feature and the configuration option. Add validation test. 4. Drop the requirement for the experimental feature. Adjust the added test and add a new one. 5. Update the user documentation. Fixes scylladb/scylladb#23030 Backport: 2025.4, as we are aiming to support materialized views for tablets from that version. Closes scylladb/scylladb#25802 * github.com:scylladb/scylladb: view: Stop requiring experimental feature db/view: Verify valid configuration for tablet-based views db/view: Require rf_rack_valid_keyspaces when creating view test/cluster/random_failures: Skip creating secondary indexes test/cluster/mv: Mark test_mv_rf_change as skipped test/cluster: Adjust MV tests to RF-rack-validity test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces db/view: Name requirement for views with tablets	2025-10-06 12:46:46 +02:00
Pavel Emelyanov	6ad8dc4a44	Merge 'root,replica: mv querier to replica/' from Botond Dénes The querier object is a confusing one. Based on its name it should be in the query/ module and it is already in the query namespace. The query namespace is used for symbols which span the coordinator and replica, or that are mostly coordinator side. The querier is mainly in this namespace due to its similar name and because at the time it was introduced, namespace replica didn't exist yet. But this is a mistake which confuses people. The querier is actually a completely replica-side logic, implementing the caching of the readers on the replica. Move it to the replica module and namespace to make this more clear. Code cleanup, no backport. Closes scylladb/scylladb#26280 * github.com:scylladb/scylladb: replica: move querier code to replica namespace root,replica: mv querier to replica/	2025-10-06 08:26:05 +03:00
Pavel Emelyanov	5cf9043d74	Merge 'sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db' from Michał Chojnowski TemporaryHashes.db is a temporary sstable component used during ms sstable writes. It's different from other sstable components in that it's not included in the TOC. Because of this, it has a special case in the logic that deletes unfinished sstables on boot. (After Scylla dies in the middle of a sstable write). But there's a bug in that special case, which causes Scylla to forget to delete other components from the same unfinished sstable. The code intends only to delete the TemporaryHashes.db file from the `_state->generations_found` multimap, but it accidentally also deletes the file's sibling components from the multimap. Fix that. Also, extend a related test so that it would catch the problem before the fix. Fixes scylladb/scylladb#26393 Bugfix, needs backport to 2025.4. Closes scylladb/scylladb#26394 * github.com:scylladb/scylladb: sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db test/boost/database_test: fix two no-op distributed loader tests	2025-10-06 08:23:03 +03:00
Andrzej Jackowski	3411089f5d	treewide: seastar module update The reason for this seastar update is fixing #26190 - a service level bug caused by a problem in scheduling group in seastar implementation (seastar#2992). * ./seastar 9c07020a...270476e7 (10): > core: restore seastar_logger namespace in try_systemwide_memory_barrier > Merge 'coroutines: support coroutines that copy their captures into the coroutine frame' from Avi Kivity coroutines: advertise lambda-capture-by-value and test it future: invoke continuation functions as temporaries future: handle lvalue references in future continuations early > resource: Tune up some allocate_io_queues() arguments > Merge 'Add perf test hooks' from Travis Downs perf_tests:add tests to verify pre-run hooks per_tests: add pre-run hooks perf-tests.md: update on measurement overhead perf_tests_perf: a few more test variations remove vestigial register_test method > Add `touch` command to `rl` file processing > Merge 'execution_stage: update stage name on scheduling_group rename' from Andrzej Jackowski test: add sg_rename_recreate_with_the_same_name test: add test_renaming_execution_stage in metric_test test: add test_execution_stage_rename execution_stage: update stage name on scheduling_group rename execution_stage: reorganize per_group_stage_type execution_stage: add concrete_execution_stage_base execution_stage: move metrics setup to a separate method > iotune: Fix warmup calculation bug and botched rebase > Add missing `#pragma once` to ascii.rl > iotune: Ignore measurements during warmup period Fixes: https://github.com/scylladb/scylladb/issues/26190 Closes scylladb/scylladb#26388	2025-10-06 08:13:37 +03:00
Michał Chojnowski	6efb807c1a	sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db TemporaryHashes.db is a temporary sstable component used during ms sstable writes. It's different from other sstable components in that it's not included in the TOC. Because of this, it has a special case in the logic that deletes unfinished sstables on boot. (After Scylla dies in the middle of a sstable write). But there's a bug in that special case, which causes Scylla to forget to delete other components from the same unfinished sstable. The code intends only to delete the TemporaryHashes.db file from the `_state->generations_found` multimap, but it accidentally also deletes the file's sibling components from the multimap. Fix that. Fixes scylladb/scylladb#26393	2025-10-04 00:45:55 +02:00
Michał Chojnowski	16cb223d7f	test/boost/database_test: fix two no-op distributed loader tests There are two tests which effectively check nothing. They intend to check that distributed loader removes "leftover" sstable files. So they create some incomplete sstables, run the test env on the directory, and the files disappeared. But the test env completely clears the test directory before the distributed loader looks at the files, so the tests succeed trivially. Fix that by adding a config knob to the test env which instructs it not to clear the directory before the test.	2025-10-04 00:44:49 +02:00
Michał Hudobski	3db2e67478	docs: adjust docs for VS auth changes We adjust the documentation to include the new VECTOR_SEARCH_INDEXING permission and its usage and also to reflect the changes in the maximal amount of service levels.	2025-10-03 16:55:57 +02:00
Michał Hudobski	e8fb745965	test: add tests for VECTOR_SEARCH_INDEXING permission This commit adds tests to verify the expected behavior of the VECTOR_SEARCH_INDEXING permission, that is, allowing GRANTing this permission only on ALL KEYSPACES and allowing SELECT queries only on tables with vector indexes when the user has this permission	2025-10-03 16:55:57 +02:00
Michał Hudobski	6a69bd770a	cql: allow VECTOR_SEARCH_INDEXING users to select This patch allows users with the VECTOR_SEARCH_INDEXING permission to perform SELECT queries on tables that have a vector index. This is needed for the Vector Store service, which reads the vector-indexed tables, but does not require the full SELECT permission.	2025-10-03 16:55:57 +02:00
Michał Hudobski	3025a35aa6	auth: add possibilty to check for any permission in set This commit adds a new version of command_desc struct that contains a set of permissions instead of a singular permission. When this struct is passed to ensure/check_has_permission, we check if the user has any of the included permission on the resource.	2025-10-03 16:55:57 +02:00
Michał Hudobski	ae86bfadac	auth: add a new permission VECTOR_SEARCH_INDEXING This patch adds a new permission: VECTOR_SEARCH_INDEXING, that is grantable only for ALL KEYSPACES. It will allow selecting from tables with vector search indexes. It is meant to be used by the Vector Store service to allow it to build indexes without having full SELECT permissions on the tables.	2025-10-03 16:36:54 +02:00
Ferenc Szili	20aeed1607	load balancing: extend locator::load_stats to collect tablet sizes This commit extend the TABLE_LOAD_STATS RPC with data about the tablet replica sizes and effective disk capacity. Effective disk capacity of a node is computed as a sum of the sizes of all tablet replicas on a node and available disk space. This is the first change in the size based load balancing series. Closes scylladb/scylladb#26035	2025-10-03 13:37:22 +02:00
Pavel Emelyanov	37f59cef04	Merge 'tools: fix documentation links after change to source-available' from Botond Dénes Some tools commands have links to online documentation in their help output. These links were left behind in the source-available change, they still point to the old opensource docs. Furthermore, the links in the scylla-sstable help output always point to the latest stable release's documentation, instead of the appropriate one for the branch the tool was built from. Fix both of these. Fixes: scylladb/scylladb#26320 Broken documentation link fix for the tool help output, needs backport to all live source-available versions. Closes scylladb/scylladb#26322 * github.com:scylladb/scylladb: tools/scylla-sstable: fix doc links release: adjust doc_link() for the post source-available world tools/scylla-nodetool: remove trailing " from doc urls	2025-10-03 13:53:19 +03:00
Pavel Emelyanov	7116e7dac6	api: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-03 13:51:25 +03:00
Pavel Emelyanov	42657105a3	api: Coroutinize get_built_indexes handler code "While at it". It looks much simpler this way. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-03 13:50:58 +03:00
Pavel Emelyanov	f77f9db96c	api: Remove system_keyspace ref from column_family API block This reference was only needed to facilitate get_built_indexes handler to work. Now it's gone and the sys.ks. reference is no longer needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-03 13:50:22 +03:00
Pavel Emelyanov	95b616d0e5	api: Move get_built_indexes from column_family to view_builder The handler effectively works with the view_builder and should be registerd in the block that has this service captured. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-03 13:49:33 +03:00
Tomasz Grabiec	9ebdeb261f	tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count The old logic assumes that replicas are spread across whole DC when determining how many tablets we need to have at least 10 tablets per shard. If replicas are actually confined to a subset of racks, that will come up with a too high count and overshoot actual per-shard count in this rack. Similar problem happens for scaling-down of tablet count, when we try to keep per-shard tablet count below the goal. It should be tracked per-rack rather than per-DC, since racks can differ in how loaded they are by RF if it's a rack-list.	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	6962464be7	locator: Make hasher for endpoint_dc_rack globally accessible	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	85ddb832b4	test: tablets: Add test for replica allocation on rack list changes	2025-10-02 19:45:00 +02:00
Benny Halevy	4955ca3ddd	test: lib: topology_builder: generate unique rack names Encode the dc identifier into each rack name so each dc will have its own unique racks. Just for easier distinction in logs. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	5fc617ecf5	test: Add tests for rack list RF	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	1d34614421	doc: Document rack-list replication factor Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	655f4ffa3c	topology_coordinator: Restore formatting	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	a21bbd4773	topology_coordinator: Cancel keyspace alter on broader set of errors We now include keyspace metadata construction, which can throw if validation fails. We want to fail the ALTER rather than keep retrying.	2025-10-02 19:44:59 +02:00
Tomasz Grabiec	d02f93e77e	topology_coordinator: Make keyspace alter process options through as_ks_metadata_update() There are several problems with how ALTER execution works with tablets. 1) Currently, option processing bypasses ks_prop_defs::prepare_options(), and will pass them directly to keyspace_metadata. This deviates from the vnode path, causing discrepancy in logic. But also there will be some non-trivial options post-processing added there - numeric RF will be replaced with a rack list. We should preserve it in the tablet path which alters the keyspace, otherwise it will fail when trying to construct network_topology_strategy. 2) Option merging happens on the flat version of the map, which won't work correctly with extended map which contains lists. We want the new list to replace the old list or numeric RF, not get its items merged. For example: We want: {'dc1': 3} + {'dc1': ['rack1', 'rack2']} = {'dc1': ['rack1', 'rack2']} If we merge flattened options, we would get incorrect flattened options: {'dc1': 3, 'dc1:0', 'rack1' 'dc1:1', 'rack2'} 3) We lose atomicity of update. Validation and merging which happens on the CQL coordinator is done in a different group0 transaction context than mutation generation inside topology coordinator later. Fixes https://github.com/scylladb/scylladb/issues/25269	2025-10-02 19:44:29 +02:00
Tomasz Grabiec	849ab5545f	cql3: ks_prop_defs: Preserve old options In `2d9b8f2`, semantics of ALTER was changed for tablet-based keyspaces which makes "replication" assignment act like +=, where replication options are merged with the old options. This merging is currently performed in the CQL statement level on options map, before passing to topology coordinator. This will change in later commit, so move merging here. Merging options of flattened level will not be correct because it doesn't recognize nested collections, like rack lists. We want: {'dc1': 3} + {'dc1': ['rack1', 'rack2']} = {'dc1': ['rack1', 'rack2']} If we merge flattened options, we would get incorrect flattened options: {'dc1': 3, 'dc1:0', 'rack1' 'dc1:1', 'rack2'} Which cannot be parsed back into ks_prop_defs on the topology coordinator. Refs https://github.com/scylladb/scylladb/pull/20208#issuecomment-3174728061 Refs #25549	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	0d0c06da06	cql3: ks_prop_defs: Introduce flattened()	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	6b7b0cb628	locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace()	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	e5b7452af2	tablet_allocator: Respect binding replicas to racks	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	6de342ed3e	locator: network_topology_strategy: Respect rack list when reallocating tablets	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	8e9a58b89f	cql3: ks_prop_defs: Fail with more information when options are not in expected format Before, we would throw vague sstring_out_of_range from substr() when the name doesn't have a nested key separate with ":", e.g "durable_writes" instead of "durable_writes:durable_writes".	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	66755db062	locator, cql3: Support rack lists in replication options Allows per-DC replication factor to be either a string, holding a numerical value, or a list of strings, holding a list of rack names. The rack list is not respected yet by the tablet allocator, this is achieved in subsequent commit. This changes the format of options stored in the flattened map in system_schema.keyspaces#replication. Values which are rack lists, are converted into multiple entries, with the list index appended to the key with ':' as the separator: For example, this extended map: { 'dc1': '3', 'dc2': ['rack1', 'rack2'] } is stored as a flattened map: { 'dc1': '3', 'dc2:0': 'rack1', 'dc2:1': 'rack2' } Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>	2025-10-02 19:42:39 +02:00
Botond Dénes	9310d3eff3	scylla-gdb.py: small-objects: don't cast free_object to void* When walking the free-list of a pool or a span, the small-object code casts the dereferenced `free_object` to `void`. This is unnecessary, just use the `next` field of the `free_object` to look up the next free object. I think this monkey business with `void*` was done to speed up walking the free-list, but recently we've seen small-object --summarize fail in CI, and it could be related. Fixes: #25733 Closes scylladb/scylladb#26339	2025-10-02 13:36:49 +03:00
Botond Dénes	9d08a380db	Merge 'Fix getendpoints command for compound keys containing ':'' from Taras Veretilnyk Before, the `nodetool getendpoints` expected the key as one string separated by : (for example 1:val:ue). This caused errors if any part of the key had a colon because it was unclear whether a colon was a separator or part of the key. This change adds a new API endpoint, `/storage_service/natural_endpoints/v2/{keyspace}`, which accepts composite partition keys as multiple key_component query parameters (e.g., ?key_component=1&key_component=val:ue). The `nodetool getendpoints` command was updated to support a new `--key-components` option, allowing users to pass key components as an array. The client and test infrastructure were extended to support multiple values for a query parameter, and tests were added to verify correct behavior with composite keys. The previous method of passing partition keys as colon-separated strings is preserved for backward compatibility. Backport is not required, since this change relies on recent Seastar updates Fixes #16596 Closes scylladb/scylladb#26169 * github.com:scylladb/scylladb: docs: document --key-components option for getendpoints test/nodetool/test_getendpoints: add coverage for --key-components param in getendpoints nodetool: Introduce new option --key-components to specify compound partition keys as array rest_api/test_storage_service: add v2 natural_endpoints test for composite key with multiple components api/storage_service: add GET 'natural_endpoints' v2 to support composite keys with ':' rest_api_mock: support duplicate query parameters test/rest_api: support multiple query values per key in RestApiSession.send() nodetool: add support of new seastar query_parameters_type to scylla_rest_client	2025-10-02 09:04:40 +03:00
Aleksandra Martyniuk	0e73ce202e	test: wait for cql in test_two_tablets_concurrent_repair_and_migration_repair_writer_level In test_two_tablets_concurrent_repair_and_migration_repair_writer_level safe_rolling_restart returns ready cql. However, get_all_tablet_replicas uses the cql reference from manager that isn't ready. Wait for cql. Fixes: #26328 Closes scylladb/scylladb#26349	2025-10-02 06:41:36 +03:00
Avi Kivity	7230a04799	dht, sstables: replace vector with chunked_vector when computing sstable shards sstable::compute_shards_for_this_sstable() has a temporary of type std::vector<dht::token_range> (aka dht::partition_range_vector), which allocates a contiguous 300k when loading an sstable from disk. This causes large allocation warnings (it doesn't really stress the allocator since this typically happens during startup, but best to clear the warning anyway). Fix this by changing the container to by chunked_vector. It is passed to dht::ring_position_range_vector_sharder, but since we're the only user, we can change that class to accept the new type. Fixes #24198. Closes scylladb/scylladb#26353	2025-10-02 00:47:42 +02:00
Michał Jadwiszczak	d92628e3bd	test/cluster/test_view_building_coordinator: skip reproducer instead of xfail The reproducer for issue scylladb/scylladb#26244 takes some time and since the test is failing, there is no point in wasting resources on it. We can change the xfail mark to skip. Refs scylladb/scylladb#26244 Closes scylladb/scylladb#26350	2025-10-01 18:33:05 +02:00
Tomasz Grabiec	c5731221c0	cql3: Fail early on vnode/tablet flavor alter Some tests expect this error. Later, prepare_options() will be changed in a way which would fail to accept new options in such case before vnode/tablet flavor change is detected, tripping the tests.	2025-10-01 16:06:52 +02:00
Tomasz Grabiec	11b4a1ab58	cql3: Extract convert_property_map() out of Cql.g So that complex code is in a .cc file for better IDE assistance.	2025-10-01 16:06:52 +02:00
Tomasz Grabiec	b6df186e54	schema: Use definition from the header instead of open-coding it	2025-10-01 16:06:52 +02:00
Tomasz Grabiec	726548b835	locator: Abstract obtaining the number of replicas from replication_strategy_config_option It will become more complex when options will contain rack lists. It's a good change regardless, as it reduces duplication and makes parsing uniform. We already diverged to use stoi / stol / stoul. The change in create_keyspace_statement.cc to add a catch clause is needed because get_replication_factor() now throws configuration_exception on parsing errors instead of std::invalid_argument, so the existing catch clause in the outer scope is not effective. That loop is trying to interpret all options as RF to run some validations. Not all options are RF, and those are supposed to be ignored.	2025-10-01 16:06:52 +02:00
Tomasz Grabiec	91e51a5dd1	cql3, locator: Use type aliases for option maps In preparation for changing their structure. 1) std::map<sstring, sstring> -> replication_strategy_config_options Parsed options. Values will become std::variant<sstring, rack_list> 2) std::map<sstring, sstring> -> property_definitions::map_type Flattened map of options, as stored system tables.	2025-10-01 16:06:51 +02:00
Tomasz Grabiec	3c31e148c5	locator: Add debug logging	2025-10-01 16:06:28 +02:00
Benny Halevy	da6e2fdb1b	locator: Pass topology to replication strategy constructor	2025-10-01 16:06:28 +02:00
Benny Halevy	3965e29075	abstract_replication_strategy, network_topology_strategy: add replication_factor_data class Prepare for supporting also list of rack names. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-10-01 16:06:27 +02:00
Taras Veretilnyk	6d8224b726	docs: document --key-components option for getendpoints	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	6381c63d65	test/nodetool/test_getendpoints: add coverage for --key-components param in getendpoints Adds a parameterized test to verify that multiple --key-components arguments are handled correctly by nodetool's getendpoints command. Ensures the constructed REST request includes all key_component values in the expected format.	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	78888dd76c	nodetool: Introduce new option --key-components to specify compound partition keys as array Allows getendpoints to accept components of partition key using the --key-components option. Key components are passed as an array and sent to the new /natural_endpoints/v2/{keyspace} endpoint.	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	2456ebd7c2	rest_api/test_storage_service: add v2 natural_endpoints test for composite key with multiple components Adds a test case for the `/storage_service/natural_endpoints/v2/{keyspace}` endpoint, verifying that it correctly resolves natural endpoints for a composite partition key passed as multiple `key_component` query parameters.	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	89d474ba59	api/storage_service: add GET 'natural_endpoints' v2 to support composite keys with ':' The original `/storage_service/natural_endpoints` endpoint uses colon-separated strings for composite keys, which causes ambiguity when key components contained colons. This commits adds a new `/storage_service/natural_endpoints/v2/{keyspace}` endpoint that accepts partition key components via repeated `key_component` query parameters to avoid this issue.	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	65ade28a9c	rest_api_mock: support duplicate query parameters Previously, only the last value of a repeated query parameter was captured, which could cause inaccurate request matching in tests. This update ensures that all values are preserved by storing duplicates as lists in the `params` dict.	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	b60afeaa46	test/rest_api: support multiple query values per key in RestApiSession.send() Previously, the send() method in RestApiSession only supported one value per query parameter key. This patch updates it to support passing lists of values, allowing the same key to appear multiple times in the query string (e.g. ?key=value1&key=value2).	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	53883958d6	nodetool: add support of new seastar query_parameters_type to scylla_rest_client	2025-10-01 15:52:18 +02:00
Avi Kivity	15fa1c1c7e	Merge 'sstables/trie: translate all key cells in one go, not lazily' from Michał Chojnowski Applying lazy evaluation to the BTI encoding of clustering keys was probably a bad default. The possible benefits are dubious (because it's quite likely that the laziness won't allow us to avoid that much work), but the overhead needed to implement the laziness is large and immediate. In this patch we get rid of the laziness. We rewrite lazy_comparable_bytes_from_clustering_position and lazy_comparable_bytes_from_ring_position so that they performs the key translation eagerly, all components to a single bytes_ostream in one synchronous call. perf_bti_key_translation (microbenchmark added in this series, 1 iteration is 100 translations of a clustering key with 8 cells of int32_type): ``` Before: test iterations median mad min max allocs tasks inst cycles lcb_mismatch_test.lcb_mismatch 9233 109.930us 0.000ns 109.930us 109.930us 4356.000 0.000 2615394.3 614709.6 After: test iterations median mad min max allocs tasks inst cycles lcb_mismatch_test.lcb_mismatch 50952 19.487us 0.000ns 19.487us 19.487us 198.000 0.000 603120.1 109042.9 ``` Enhancement, backport not required. Closes scylladb/scylladb#26302 * github.com:scylladb/scylladb: sstables/trie: BTI-translate the entire partition key at once sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset() sstables/trie: perform the BTI-encoding of position_in_partition eagerly types/comparable_bytes: add comparable_bytes_from_compound test/perf: add perf_bti_key_translation	2025-10-01 14:59:06 +03:00
Dawid Mędrek	b409e85c20	view: Stop requiring experimental feature We modify the requirements for using materialized views in tablet-based keyspaces. Before, it was necessary to enable the configuration option `rf_rack_valid_keyspaces`, having the cluster feature `VIEWS_WITH_TABLETS` enabled, and using the experimental feature `views-with-tablets`. We drop the last requirement. We adjust code to that change and provide a new validation test. We also update the user documentation to reflect the changes. Fixes scylladb/scylladb#23030	2025-10-01 09:01:53 +02:00
Dawid Mędrek	288be6c82d	db/view: Verify valid configuration for tablet-based views Creating a materialized view or a secondary index in a tablet-based keyspace requires that the user enabled two options: * experimental feature `views-with-tablets`, * configuration option `rf_rack_vaid_keyspaces`. Because the latter has only become a necessity recently (in this series), it's possible that there are already existing materialized views that violate it. We add a new check at start-up that iterates over existing views and makes sure that that is not the case. Otherwise, Scylla notifies the user of the problem.	2025-10-01 09:01:53 +02:00
Dawid Mędrek	00222070cd	db/view: Require rf_rack_valid_keyspaces when creating view We extend the requirements for being able to create materialized views and secondary indexes in tablet-based keyspaces. It's now necessary to enable the configuration option `rf_rack_valid_keyspaces`. This is a stepping stone towards bringing materialized views and secondary indexes with tablets out of the experimental phase. We add a validation test to verify the changes. Refs scylladb/scylladb#23030	2025-10-01 09:01:50 +02:00
Dawid Mędrek	71606ffdda	test/cluster/random_failures: Skip creating secondary indexes Materialized views are going to require the configuration option `rf_rack_valid_keyspaces` when being created in tablet-based keyspaces. Since random-failure tests still haven't been adjusted to work with it, and because it's not trivial, we skip the cases when we end up creating or dropping an index.	2025-10-01 09:01:38 +02:00
Dawid Mędrek	6322b5996d	test/cluster/mv: Mark test_mv_rf_change as skipped The test will not work with `rf_rack_valid_keyspaces`. Since the option is going to become a requirement for using views with tablets, the test will need to be rewritten to take that into consideration. Since that adjustment doesn't seem trivial, we mark the test as skipped for the time being.	2025-10-01 09:01:29 +02:00
Botond Dénes	bdca5600ef	Merge 'Prevent stalls due to large tablet mutations' from Benny Halevy Currently, replica::tablet_map_to_mutation generates a mutation having a row per tablet. With enough tablets (10s of thousands) in the table we observe reactor stalls when freezing / unfreezing such large mutations, as seen in https://github.com/scylladb/scylladb/pull/18095#issuecomment-2029246954, and I assume we would see similar stalls also when converting those mutation into canonical_mutation and back, as they are similar to frozen_mutation, and bit more expensive since they also save the column mappings. This series takes a different approach than allowing freeze to yield. `tablet_map_to_mutation` is changed to `tablet_map_to_mutations`, able to generate multiple split mutations, that when squashed together are equivalent to the previously large mutation. Those mutations are fed into a `process_mutation` callback function, provided by the caller, which may add those mutation to a vector for further processing, and/or process them inline by freezing or making a canonical mutation. In addition, split the large mutations would also prevent hitting the commitlog maximum mutation size. Closes scylladb/scylladb#18162 * github.com:scylladb/scylladb: schema_tables: convert_schema_to_mutations: simplify check for system keyspace tablets: read_tablet_mutations: use unfreeze_and_split_gently storage_service: merge_topology_snapshot: freeze snp.mutations gently mutation: async_utils: add unfreeze_and_split_gently mutation: add for_each_split_mutation tablets: tablet_map_to_mutations: maybe split tablets mutation tablets: tablet_map_to_mutations: accept process_func perf-tablets: change default tables and tablets-per-table perf-tablets: abort on unhandled exception	2025-10-01 07:04:09 +03:00
Ernest Zaslavsky	043d2dfb30	treewide: seastar module update Seastar module update ``` 9c07020a Merge 'http: Introduce retry strategy machinery for http client (take two)' from Ernest Zaslavsky 58404b81 http: check for abort at start of `make_request` 35a9e086 http: support per-call `retry_strategy` in `make_request` 96538b92 http: integrate `retry_strategy` into HTTP client 77c3ba14 http: initial implementation of `retry_strategy` b9b9e7bf memory: Call finish_allocation() at the end of allocate_aligned() 2052c200 Merge 'file: coroutinize some functions' from Avi Kivity 7b65e50c file: reindent after coroutinization 837f64b5 file: coroutinize dma_read_impl() 9220607b file: coroutinize dma_read_exactly_impl() d1414541 file: coroutinize set_lifetime_hint_impl() 94d8fd08 file: coroutinize get_lifetime_hint_impl() 392efff4 file: coroutinize maybe_read_eof() e68a3173 file: do_dma_read_bulk: remove "rstate" local 14ac42cd file: coroutinize do_dma_read_bulk() 5446cbab net: Use future::then_unpack() helper to unpack tuples 9e88c4d8 posix-stack: Initialize unique_ptr-s with new result directly 51fb302e rpc: connection::process() use structured binding be2c2b54 http: Explicitly deprecate request::content ``` Closes scylladb/scylladb#26342	2025-10-01 06:44:31 +03:00
Jenkins Promoter	f8c02a420d	Update pgo profiles - aarch64	2025-10-01 05:32:35 +03:00
Jenkins Promoter	b45a57f65e	Update pgo profiles - x86_64	2025-10-01 04:54:14 +03:00
Dawid Mędrek	994f09530f	test/cluster: Adjust MV tests to RF-rack-validity Some of the new tests covering materialized views explicitly disabled the configuration option `rf_rack_valid_keyspaces`. It's going to become a new requirement for views with tablets, so we adjust those tests and enable the option. There is one exception, the test: `cluster/mv/test_mv_topology_change.py::test_mv_rf_change` We handle it separately in the following commit.	2025-09-30 20:01:25 +02:00
Luis Freitas	884c584faf	Update ScyllaDB version to: 2026.1.0-dev	2025-09-30 18:54:09 +03:00
Benny Halevy	1ceb49f6c1	schema_tables: convert_schema_to_mutations: simplify check for system keyspace Currently, the function unfreezes each schema mutation partition and then checks if it's for a system keyspace. This isn't really needed since we can check the partition key using the frozen_mutation, skip it if the partition is for a system keyspace. Note that the constructed partition_key just copies the frozen partition_key_view, without copying or deserializing the actual key contents. Also, reserve `results` capacity using the queried partitions' size to prevent reallocations of the results vector. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Benny Halevy	b17a36c071	tablets: read_tablet_mutations: use unfreeze_and_split_gently Split the tablets mutations by number of rows, based on `min_tablets_in_mutation` (currently calibrated to 1024), similar to the splitting done in `storage_service::merge_topology_snapshot`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Benny Halevy	25b68fd211	storage_service: merge_topology_snapshot: freeze snp.mutations gently We don't need to store all snp.mutations in a vector and then freeze the whole vector. They can be frozen one at a time and collected into a vector, while maybe yielding between each mutation to prevent stalls. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Benny Halevy	fd38cfaf69	mutation: async_utils: add unfreeze_and_split_gently Unfreeze the frozen_mutation, possibly splitting it based on max_rows. The process_mutation function is called for each split mutation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Benny Halevy	faa0ee9844	mutation: add for_each_split_mutation Allows processing of the split mutations one at a time. This can reduce memory footprint as the caller won't have to store a vector of the split mutations and then convert it (e.g. freeze the mutations or convert them to canonical mutations). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Benny Halevy	d21984d0cc	tablets: tablet_map_to_mutations: maybe split tablets mutation Split the generated tablets mutation if we run out of task quota to prevent stalls, both when preparing the mutations and later on when freezing/unfreezing them or converting them to canonical_mutation and back. Note that this will convert large mutation to long vectors of mutations. A followup change is considered to convert std::vector:s of mutations to chunked_vector to prevent large allocations. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Benny Halevy	aaddff5211	tablets: tablet_map_to_mutations: accept process_func Prepare for generating several mutations for the tablet_map by calling process_func for each generated mutation. This allows the caller to directly freeze those mutations one at a time into a vector of frozen mutations or simililarly convert them into canonical mutations. Next patch will split large tablet mutations to prevent stalls. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:38 +03:00
Avi Kivity	5d1846d783	dist: scylla_raid_setup: don't override XFS block size on modern kernels In `6977064693` ("dist: scylla_raid_setup: reduce xfs block size to 1k"), we reduced the XFS block size to 1k when possible. This is because commitlog wants to write the smallest amount of padding it can, and older Linux could only write a multiple of the block size. Modern Linux [1] can O_DIRECT overwrite a range smaller than a filesystem block. However, this doesn't play well with some SSDs that have 512 byte logical sector size and 4096 byte physical sector size - it causes them to issue read-modify-writes. To improve the situation, if we detect that the kernel is recent enough, format the filesystem with its default block size, which should be optimal. Note that commitlog will still issue sub-4k writes, which can translate to RMW. There, we believe that the amplification is reduced since sequential sub-physical-sector writes can be merged, and that the overhead from commitlog space amplification is worse than the RMW overhead. Tested on AWS i4i.large. fsqual report: ``` memory DMA alignment: 512 disk DMA alignment: 512 filesystem block size: 4096 context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0.0003 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.7961 (BAD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0.0001 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.8006 (BAD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0.0001 (GOOD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD) ``` The sub-block overwrite cases are GOOD. In comparison, the fsqual report for 1k (similar): ``` memory DMA alignment: 512 disk DMA alignment: 512 filesystem block size: 1024 context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0.0005 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.7948 (BAD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0015 (GOOD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0022 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.4999 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.798 (BAD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0012 (GOOD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0019 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.5 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD) ``` Fixes #25441. [1] `ed1128c2d0` Closes scylladb/scylladb#25445	2025-09-30 17:14:36 +03:00
Benny Halevy	3c07e0e877	perf-tablets: change default tables and tablets-per-table tablets-per-table must be a power of 2, so round up 10000 to 16K. also, reduce number of tables to have a total of about 100K tablets, otherwise we hit the maximum commitlog mutation size limit in save_tablet_metadata. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:07:06 +03:00
Benny Halevy	2c3fb341e9	perf-tablets: abort on unhandled exception Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:07:06 +03:00
Nadav Har'El	926089746b	message: move RPC compression from utils/ to message/ The directory utils/ is supposed to contain general-purpose utility classes and functions, which are either already used across the project, or are designed to be used across the project. This patch moves 8 files out of utils/: utils/advanced_rpc_compressor.hh utils/advanced_rpc_compressor.cc utils/advanced_rpc_compressor_protocol.hh utils/stream_compressor.hh utils/stream_compressor.cc utils/dict_trainer.cc utils/dict_trainer.hh utils/shared_dict.hh These 8 files together implement the compression feature of RPC. None of them are used by any other Scylla component (e.g., sstables have a different compression), or are ready to be used by another component, so this patch moves all of them into message/, where RPC is implemented. Theoretically, we may want in the future to use this cluster of classes for some other component, but even then, we shouldn't just have these files individually in utils/ - these are not useful stand-alone utilities. One cannot use "shared_dict.hh" assuming it is some sort of general-purpose shared hash table or something - it is completely specific to compression and zstd, and specifically to its use in those other classes. Beyond moving these 8 files, this patch also contains changes to: 1. Fix includes to the 5 moved header files (.hh). 2. Fix configure.py, utils/CMakeLists.txt and message/CMakeLists.txt for the three moved source files (.cc). 3. In the moved files, change from the "utils::" namespace, to the "netw::" namespace used by RPC. Also needed to change a bunch of callers for the new namespace. Also, had to add "utils::" explicitly in several places which previously assumed the current namespace is "utils::". Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25149	2025-09-30 17:03:09 +03:00
Pavel Emelyanov	269aaee1b4	Merge 'test: dtest: test_limits.py: migrate from dtest' from Dario Mirovic This PR migrates limits tests from dtest to this repository. One reason is that there is an ongoing effort to migrate tests from dtest to here. Debug logs are enabled on `test_max_cells` for `lsa-timing` logger, to have more information about memory reclaim operation times and memory chunk sizes. This will allow analysis of their value distributions, which can be helpful with debugging if the issue reoccurs. Also, scylladb keeps sql files with metrics which, with some modifications, can be used to track metrics over time for some tests. This would show if there are pauses and spikes or the test performance is more or less consistent over time. scylla-dtest PR that removes migrated tests: [limits_test.py: remove tests already ported to scylladb repo #6232](https://github.com/scylladb/scylla-dtest/pull/6232) Fixes #25097 This is a migration of existing tests to this repository. No need for backport. Closes scylladb/scylladb#26077 * github.com:scylladb/scylladb: test: dtest: limits_test.py: test_max_cells log level test: dtest: limits_test.py: make the tests work test: dtest: test_limits.py: remove test that are not being migrated test: dtest: copy unmodified limits_test.py	2025-09-30 16:57:32 +03:00
Piotr Dulikowski	5e5a3c7ec5	view_building_worker.cc: fix spelling (commiting -> committing) The typo is reported by GitHub action on each PR, so let's fix it to reduce the noise for everybody. Closes scylladb/scylladb#26329	2025-09-30 16:47:03 +03:00
Emil Maskovsky	b0de054439	docs: fix typos and spelling errors Corrected spelling mistakes, typos, and minor wording issues to improve the developer documentation. No backport: There is no functional change, and the doc is mostly relevant to master, so it doesn't need to be backported. Closes scylladb/scylladb#26332	2025-09-30 13:16:49 +02:00
Avi Kivity	72609b5f69	Merge 'mv: generate view updates on pending replica' from Michael Litvak Generate view updates from a pending base replica if it's a reading replica, i.e. it's in the last stage of transition write_both_read_new before becoming the new base replica. Previously we didn't generate view updates on a pending replica. The problem with that is that when a base token is migrated from one replica B1 to another B2, at one stage we generate view updates only from B1, then at the next stage we generate view updates only from B2. During this transition, it can happen that for some write neither B1 nor B2 generate view update, because each one sees the other as the base replica. We fix this by generating view updates from both base replicas in the phase before the transition. We can generate view updates on the pending replica in this case, even if it requires read-before-write, because it's in a stage where it contains all data and serves reads. Fixes https://github.com/scylladb/scylladb/issues/24292 backport not needed - the issue mostly affects MV with tablets which is still experimental Closes scylladb/scylladb#25904 * github.com:scylladb/scylladb: test: mv: test view update during topology operations mv: generate view updates on both shards in intranode migration mv: generate view updates on pending replica	2025-09-30 13:17:16 +03:00
Piotr Wieczorek	4be0bdbc07	alternator: Don't emit a redundant REMOVE event in Alternator Streams for PutItem calls Until now, every PutItem operation appeared in the Alternator Streams as two events - a REMOVE and a MODIFY. DynamoDB Streams emits only INSERT or MODIFY, depending on whether a row was replaced, or created anew. A related issue scylladb#6918 concerns distinguishing the mutation type properly. This was because each call to PutItem emitted the two CDC rows, returned by GetRecords. Since this patch, we use a collection tombstone for the `:attrs` column, and a separate tombstone for each regular column in the table's schema. We don't expect that new tables would have any other regular column, except for the `:attrs` and keys, but we may encounter them in in upgraded tables which had old GSIs or LSIs. Fixes: scylladb#6930. Closes scylladb/scylladb#24991	2025-09-30 13:12:16 +03:00
Michał Hudobski	ae4d4908ba	configure: increase SCHEDULING_GROUPS_COUNT to 20 We would like to have an additional service level available for users of the Vector Store service, which would allow us to de/prioritize vector operations as needed. To allow that, we increase the number of scheduling groups from 19 to 20 and adjust the related test accordingly. Closes scylladb/scylladb#26316	2025-09-30 12:41:28 +03:00
Nadav Har'El	38002718a9	cqlpy: improve testing for "duration" column type We had very rudimentary tests for the "duration" CQL type in the cqlpy framework - just for reproducing issue #8001. But we left two alternative formats, and a lot of corner cases, untested. So this patch aims to add the missing tests - to exhaustively cover the "duration" literal formats and their capabilities. Some of the examples tested in the new test are inspired by Cassandra's unit test test/unit/org/apache/cassandra/cql3/DurationTest.java and the corner cases that this file covers. However, the new tests are not direct translation of that file because DurationTest.java was not a CQL test - it was a unit test of Cassandra's internal "Duration" type, so could not be directly translated into a CQL-based test. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25092	2025-09-30 12:25:02 +03:00
Botond Dénes	efd99bb0af	Merge 'Return tablet ranges from range_to_endpoint_map API' from Pavel Emelyanov The handler in question when called for tablets-enabled keyspace, returns ranges that are inconsistent with those from system.tablets. Like this: system.tablets: ``` TabletReplicas(last_token=-4611686018427387905, replicas=[('e43ce450-2834-4137-92b7-379bb37684d1', 0), ('67c82fc2-8ef9-4dd9-8cf6-c7f9372ce207', 0)]) TabletReplicas(last_token=-1, replicas=[('22c84cba-d8d0-4d20-8d46-eb90865bb612', 0), ('67c82fc2-8ef9-4dd9-8cf6-c7f9372ce207', 1)]) TabletReplicas(last_token=4611686018427387903, replicas=[('22c84cba-d8d0-4d20-8d46-eb90865bb612', 1), ('67c82fc2-8ef9-4dd9-8cf6-c7f9372ce207', 1)]) TabletReplicas(last_token=9223372036854775807, replicas=[('e43ce450-2834-4137-92b7-379bb37684d1', 1), ('22c84cba-d8d0-4d20-8d46-eb90865bb612', 0)]) ``` range_to_endpoint_map: ``` {'key': ['-9069053676502949657', '-8925522303269734226'], 'value': ['127.110.40.2', '127.110.40.3']} {'key': ['-8925522303269734226', '-8868737574445419305'], 'value': ['127.110.40.2', '127.110.40.3']} ... {'key': ['-337928553869203886', '-288500562444694340'], 'value': ['127.110.40.1', '127.110.40.3']} {'key': ['-288500562444694340', '105026475358661740'], 'value': ['127.110.40.1', '127.110.40.3']} {'key': ['105026475358661740', '611365860935890281'], 'value': ['127.110.40.1', '127.110.40.3']} ... {'key': ['8307064440200319556', '9117218379311179096'], 'value': ['127.110.40.2', '127.110.40.1']} {'key': ['9117218379311179096', '9125431458286674075'], 'value': ['127.110.40.2', '127.110.40.1']} ``` Not only the number of ranges differs, but also separating tokens do not match (e.g. tokens -2 and 0 belong to different tablets according to system.tablets, but fall into the same "range" in the API result). The source of confusion is that despite storage_service::get_range_to_address_map() is given correct e.r.m. pointer from the table, it still uses token_metadata::sorted_token() to work with. The fix is -- when the e.r.m. is per-table, the tokens should be get from token_metadata's tablet_map (e.g. compare this to storage_service::effective_ownership() -- it grabs tokens differently for vnodes/tables cases). This PR fixes the mentioned problem and adds validation test. The test also checks /storage_service/describe_ring endpoint that happens to return correct set of values. The API is very ancient, so the bug is present in all versions with tablets Fixes #26331 Closes scylladb/scylladb#26231 * github.com:scylladb/scylladb: test: Add validation of data returned by /storage_service endpoints test,lib: Add range_to_endpoint_map() method to rest client api: Indentation fix after previous patches storage_service: Get tablet tokens if e.r.m. is per-table storage_service,api: Get e.r.m. inside get_range_to_address_map() storage_service: Calculate tokens on stack	2025-09-30 11:20:35 +03:00
Michał Jadwiszczak	3bbbbf419b	test/cluster/test_view_building_coordinator: add reproducer for staging sstables with tablet merge The test verifies if staging sstables are processed correctly after tablet merge. Refs scylladb/scylladb#26244 Closes scylladb/scylladb#26286	2025-09-30 09:05:31 +02:00
Avi Kivity	4d9271df98	Merge 'sstables: introduce sstable version `ms`' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation. This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version. (Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)). The high-level structure of the PR is: 1. Introduce new component types — `Partitions` and `Rows`. 2. Teach `class sstable` to open them when they exist. 3. Teach the sstable writer how to write index data to them. 4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead). 5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`. 6. Prepare unit tests for the appearance of `ms`. 7. Enable `ms` in unit tests. 8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled). 9. Prepare integration tests for the appearance of `ms`. 10. Enable both `ms` and `me` in tests where we want both versions to be tested. This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config. Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`. This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row. `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 70 18.0 18 128 500001 1 1 647 19.0 19 132 0 1000000 1 748 15.0 15 116 0 500000 2 372 29.0 29 284 0 250000 4 227 56.0 56 504 0 125000 8 116 106.0 106 928 0 62500 16 67 195.0 195 1732 ``` `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 51 5.1 5 20 500001 1 1 64 5.3 5 20 0 1000000 1 679 4.0 4 16 0 500000 2 492 8.0 8 88 0 250000 4 804 16.0 16 232 0 125000 8 409 31.0 31 516 0 62500 16 97 54.0 54 1056 ``` Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`: Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`. For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`. External tests: I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed. New functionality, no backport needed. Closes scylladb/scylladb#26215 * github.com:scylladb/scylladb: test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes test/cluster: add test_bti_index.py test: prepare bypass_cache_test.py for `ms` sstables sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write` db/config: expose "ms" format to the users via database config test: in Python tests, prepare some sstable filename regexes for `ms` sstables: add `ms` to `all_sstable_versions` test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests test/lib/index_reader_assertions: skip some row index checks for BTI indexes test/boost/sstable_inexact_index_test: explicitly use a `me` sstable test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables test/resource: add `ms` sample sstable files for relevant tests test/boost/sstable_compaction_test: prepare for `ms` sstables. test/boost/index_reader_test: prepare for `ms` sstables test/boost/bloom_filter_tests: prepare for `ms` sstables test/boost/sstable_datafile_test: prepare for `ms` sstables test/boost/sstable_test: prepare for `ms` sstables. sstables: introduce `ms` sstable format version tools/scylla-sstable: default to "preferred" sstable version, not "highest" sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller sstables/mx: make Index and Summary components optional sstables: open Partitions.db early when it's needed to populate key range for sharding metadata sstables: adapt sstable::set_first_and_last_keys to sstables without Summary sstables: implement an alternative way to rebuild bloom filters for sstables without Index utils/bloom_filter: add `add(const hashed_key&)` sstables: adapt estimated_keys_for_range to sstables without Summary sstables: make `sstable::estimated_keys_for_range` asynchronous sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary replica/database: add table::estimated_partitions_in_range() sstables/mx: implement sstable::has_partition_key using a regular read sstables: use BTI index for queries, when present and enabled sstables/mx/writer: populate BTI index files sstables: create and open BTI index files, when enabled sstables: introduce Partition and Rows component types sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`	2025-09-30 09:40:02 +03:00
Piotr Dulikowski	4581c72430	Merge 'lwt: prohibit for tablet-based views and cdc logs' from Petr Gusev `SELECT` commands with SERIAL consistency level are historically allowed for vnode-based views, even though they don't provide linearizability guarantees and in general don't make much sense. In this PR we prohibit LWTs for tablet-based views, but preserve old behavior for vnode-based views for compatibility. Similar logic is applied to CDC log tables. We also add a general check that disallows colocating a table with another colocated table, since this is not needed for now. Fixes https://github.com/scylladb/scylladb/issues/26258 backports: not needed (a new feature) Closes scylladb/scylladb#26284 * github.com:scylladb/scylladb: cql_test_env.cc: log exception when callback throws lwt: prohibit for tablet-based views and cdc logs tablets: disallow chains of colocated tables database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda	2025-09-30 07:15:16 +02:00
Michał Chojnowski	771a82969e	test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes Adds a test for the bloom filter rebuild mechanism in `ms` sstables.	2025-09-29 22:15:26 +02:00
Michał Chojnowski	c1e6cd58fa	test/cluster: add test_bti_index.py Add a test which checks that `ms` sstables can be enabled and disabled.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	7a2dc9cbfa	test: prepare bypass_cache_test.py for `ms` sstables The test looks at metrics to confirm whether queries hit the row cache, the index cache or the disk, depending on various settings. BIG index readers use a two level, read-through index cache, where the higher layer stores parsed "index pages" of Index.db, while the lower layer is a cache of raw 4kiB file pages of Index.db. Therefore, if we want to count index cache hits, the appropriate metric to check in this case is `scylla_sstables_index_page_hits", which counts hits in the higher layer. This is done by the test. However, BTI index readers don't have an equivalent of the higher cache layer. Their cache only stores the raw 4 kiB pages, and the hits are counted in `scylla_sstables_index_page_cache_hits`. (The same metric is incremented by the lower layer of the BIG index cache). Before this commit, the test would fail with `ms` sstables, because their reads don't increment `scylla_sstables_index_page_hits`. In this commit we adapt the test so that it instead checks `scylla_sstables_index_page_cache_hits` for `ms` sstables.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	621bfbe6d9	sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present test_column_family.py::test_sstables_by_key_reader_closed injects a failure into `index_reader::advance_lower_and_check_if_present`. To preserve this tests when BTI indexes are made the default, we have to add a corresponding error injection to `bti_index_reader::advance_lower_and_check_if_present`.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	182c8ce87b	test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables BIG sstables and BTI sstables use different code paths for validating the Data file against the index. So we want to test both types of indexes, not just the default one. This patch changes the test so that it explicitly tests both `me` and `ms` instead of only testing the default format. Note that we disable some tests for BTI indexes: the tests which check that validation detects mismatches between the row index ("promoted index") and the Data file. This is because currently iteration over the row index in BTI isn't implemented at the moment, so for BTI the validation behaves as if there was no row indexes.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	aed1cb6f65	tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write` A useful option in general, and I'll need it to test multiple versions in `test_sstable_validation.py`.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	ef11dc57c1	db/config: expose "ms" format to the users via database config Extend the `sstable_format` config enum with a "ms" value, and, if it's enabled (in the config and in cluster features), use it for new sstables on the node. (Before this commit, writing `ms` sstables should only be possible in unit tests, via internal APIs. After this commit, the format can be enabled in the config and the database will write it during normal operation). As of this commit, the new format is not the default yet. (But it will become the default in a later commit in the same series).	2025-09-29 22:15:25 +02:00
Michał Chojnowski	2ed2033224	test: in Python tests, prepare some sstable filename regexes for `ms`	2025-09-29 22:15:25 +02:00
Michał Chojnowski	fe9f5f4da2	sstables: add `ms` to `all_sstable_versions` Add `ms` to the lists of sstable formats. This will cause it to be included in various unit tests.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	9155eeed10	test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests Add `ms` to tests which already test many format versions. The tests check that sstable files in newer verisons are the same as in `mc`. Arbitrarily, for `ms`, we only check the files common between `mc` and `ms`. If we want to extend this test more, so that it checks that `Partitions.db` and `Rows.db` don't change over time, we have to add `ms` versions of all the sstables under `test/resources` which are used in this test. We won't do that in this patch series. And I'm not sure if we want to do that at all.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	70a6c9481b	test/lib/index_reader_assertions: skip some row index checks for BTI indexes Block monotonicity checks can't be implemented for BTI row indexes because they don't store full clustering positions, only some encoded prefixes. The emptiness check could be implemented with some effort, but we currently don't bother. The two tests which use this `is_empty()` method aren't very useful anyway. (They check that the promoted index is empty when there are no clustering keys. That doesn't really need a dedicated test).	2025-09-29 22:15:25 +02:00
Michał Chojnowski	d53f362328	test/boost/sstable_inexact_index_test: explicitly use a `me` sstable The test currently implicitly uses the default sstable format. But it assumes that the index reader type is `sstables::index_reader`, and it wants some methods specific to that type (and absent from the base `abstract_index_reader`). If we switch the default format from `me` to `ms`, without doing something about this, this test will start failing on the downcast to `sstables::index_reader`. We deal with this by explicitly specifying `me`. `me` and `ms` data readers are identical. And this is a test of the data reader, not the index reader. So it's perfectly fine to just use `me`.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	fca56cb458	test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables This is an old test for some workaround for incorrectly-generated promoted indexes. It doesn't make sense to port this test to newer sstable formats. So just skip it for the new sstable versions.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	1f7882526b	test/resource: add `ms` sample sstable files for relevant tests There are some tests which want sstables of all format versions in `test/resource`. This tests adds `ms` files for those tests. I didn't think much about this change, I just mechanically generated the `ms` from the existing `me` sstables in the same directories (using `scylla sstable upgrade`) for the tests which were complaining about the lack of `ms` files.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	6143dce3db	test/boost/sstable_compaction_test: prepare for `ms` sstables. Fix incompatibilites between the test's assumptions and the upcoming addition of `ms` sstables. Refer to individual tests for comments.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	622149a183	test/boost/index_reader_test: prepare for `ms` sstables Adjust the incompatibilities between the test and the upcoming `ms` sstables. Refer to individual test for comments.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	a67d10d15d	test/boost/bloom_filter_tests: prepare for `ms` sstables The test for the bloom filter rebuild mechanism has to be adjusted, because `ms` sstables won't use this mechanism.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	312423fe53	test/boost/sstable_datafile_test: prepare for `ms` sstables The tests touched in this commit are concerned specifically with Summary. They are not applicable to sstables with BTI indexes.	2025-09-29 22:15:24 +02:00
Michał Chojnowski	924b8eec11	test/boost/sstable_test: prepare for `ms` sstables. Skip `ms` sstables in an uninteresting test which relies on `sstables::index_reader`.	2025-09-29 22:15:24 +02:00
Michał Chojnowski	db4283b542	sstables: introduce `ms` sstable format version Introduce `ms` -- a new sstable format version which is a hybrid of Cassandra's `me` and `da`. It is based on `me`, but with the index components (Summary.db and Index.db) replaced with the index components of `da` (Partitions.db and Rows.db). As of this patch, the version is never chosen anywhere for writing sstables yet. It is only introduced. We will add it to unit tests in a later commit, and expose it to users in yet later commit.	2025-09-29 22:15:24 +02:00
Michał Chojnowski	17085dc1e4	tools/scylla-sstable: default to "preferred" sstable version, not "highest" Later in this patch series we will introduce `ms` as the new highest format, but we won't be able to make it the default within the same series due to some dtest incompatibilities. Until `ms` is the default, we don't `scylla sstable` to default to it, even though it's the highest. Let's choose the default version in `scylla sstable` using the same method which is used by Scylla in general: by letting the `sstable_manager` choose.	2025-09-29 22:13:59 +02:00
Nadav Har'El	3a5475afb7	Merge 'metrics, vector search: add metrics to the vector store client' from Michał Hudobski This PR adds metrics to the vector store client, as described in https://scylladb.atlassian.net/wiki/spaces/RND/pages/86245395/Vector+Store+Core+APIs#Metrics: - number of the dns refreshes We would like the dns refreshes to see if the network client is working properly. Here is the added metric: \# HELP scylla_vector_store_dns_refreshes Number of DNS refreshes \# TYPE scylla_vector_store_dns_refreshes gauge scylla_vector_store_dns_refreshes{shard="0"} 1.000000 Fixes: VECTOR-68 Closes scylladb/scylladb#25288 * github.com:scylladb/scylladb: metrics, test: added a test case for vs metrics metrics, vector_search: add a dns refresh metric vector_search: move the ann implementation to impl	2025-09-29 22:31:03 +03:00
Petr Gusev	29f9c355ab	cql_test_env.cc: log exception when callback throws When a test fails inside a do_with_cql_env callback, the logs don’t make it clear where the failure happened. This is because cql_env immediately begins shutting down services, which obscures the original failure.	2025-09-29 17:53:36 +02:00
Lakshmi Narayanan Sreethar	7b97928152	cmake: link `vector_search` to `test-lib` instead of `cql3` PR #26237 fixed linker errors by linking `cql3` to `vector_search` but this introduced a circular dependency between these two static libraries, sometimes causing failures during compilation : ``` ninja: error: dependency cycle: /home/user/Development/scylladb/build/debug/cql3/CqlParser.hpp -> data_dictionary/libdata_dictionary.a -> data_dictionary/CMakeFiles/data_dictionary.dir/data_dictionary.cc.o -> /home/user/Development/scylladb/build/debug/cql3/CqlParser.hpp ``` So, instead of linking the `vector_search` library to the `cql3` library, link it directly to the executable where the `cql3` library is also to be linked. For the test cases, this means linking `vector_search` to the `test-lib` library. Since both `vector_search` and `cql3` are static libraries, the linker will resolve them correctly regardless of the order in which they are linked. Refs #26235 Refs #26237 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26318	2025-09-29 17:46:58 +03:00
Botond Dénes	fe73c90df9	tools/scylla-sstable: fix doc links The doc links in scylla-sstable help output are static, so they always point to the documentation of the latest stable release, not to the documentation of the release the tool binary is from. On top of that, the links point to old open-source documentation, which is now EOL. Fix both problems: point link at the new source-available documentation pages and make them version aware.	2025-09-29 17:34:37 +03:00
Botond Dénes	15a4a9936b	release: adjust doc_link() for the post source-available world There is no more separate enterprise product and the doc urls are slightly different.	2025-09-29 17:02:55 +03:00
Botond Dénes	5a69838d06	tools/scylla-nodetool: remove trailing " from doc urls They are accidental leftover from a previous way of storing command descriptions.	2025-09-29 17:02:40 +03:00
Benny Halevy	b81c6a339b	test_tablets_merge: test_tablet_split_merge_with_many_tables: reduce number of tables in debug mode As the test hits timeouts in debug mode on aarch64. Fixes #26252 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26303	2025-09-29 15:30:13 +03:00
Michael Litvak	d94c1f6674	test: mv: test view update during topology operations add new test cases checking view consistency when writing to a table with MV and generating view updates while data is migrated. one case has tablet migrations while writing to the table. The other case does the equivalent for vnode keyspaces - it adds a new node. The tests reproduce issue scylladb/scylladb#24292	2025-09-29 13:44:04 +02:00
Michael Litvak	c9237bf5f6	mv: generate view updates on both shards in intranode migration Similarly to the issue of tokens migrating from one host to another, where we need to generate view updates on both replicas before transitioning in order to not lose view updates, we need to do the same in case of intranode migration. In intranode migration we migrate tokens from one shard to another. Previously we checked shard_for_reads in order to generate view updates only on the single shard that is selected for reads, and not on a pending shard that is not ready yet. The problem is that shard_for_reads switches from the source shard to the destination shard in a single transition, and during that switch we can lose view updates because neither shard sees itself as the shard for reads. We fix this by having a phase before the transition when both shards are ready for reads and both will generate view updates.	2025-09-29 13:44:04 +02:00
Michael Litvak	d842ea2dc9	mv: generate view updates on pending replica Generate view updates from a pending base replica if it's a reading replica, i.e. it's in the last stage of transition write_both_read_new before becoming the new base replica. Previously we didn't generate view updates on a pending replica. The problem with that is that when a base token is migrated from one replica B1 to another B2, at one stage we generate view updates only from B1, then at the next stage we generate view updates only from B2. During this transition, it can happen that for some write neither B1 nor B2 generate view update, because each one sees the other as the base replica. We fix this by generating view updates from both base replicas in the phase before the transition. We can generate view updates on the pending replica in this case, even if it requires read-before-write, because it's in a stage where it contains all data and serves reads. Fixes scylladb/scylladb#24292	2025-09-29 13:44:04 +02:00
Botond Dénes	7a773da425	Merge 'Speed up test cluster/test_alternator::test_localnodes_joining_nodes' from Nadav Har'El Before this patch, the test `cluster/test_alternator::test_localnodes_joining_nodes` was one of the slowest tests in the test/cluster framework, taking over two minutes to run. As comments in the test already acknowledged, there was no good reason why this test had to be so slow. The test needed to, intentionally, boot a server which took a long time (2 minutes) to fail its boot. But it didn't really need to wait for this failure - the right thing to do was to just kill the server at the end of the test. But we just didn't have the test-framework API to do it. So in this series, the first patch introduces the missing API, and the second patch uses it to fix test_localnodes_joining_nodes to kill the (unsuccessfully) booting server. After this patch, the test takes just 7 seconds to run. This is a test speedup only, so no real need to backport it - old release anyway get fewer test runs and the latency of these runs is less important. Closes scylladb/scylladb#25312 * github.com:scylladb/scylladb: test/cluster: greatly speed up test_localnodes_joining_nodes test/pylib: add the ability to stop currently-starting servers	2025-09-29 14:34:34 +03:00
Dawid Mędrek	d6fcd18540	test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces The test cases in the file aren't run via an existing interface like `do_with_cql_env`, but they rely on a more direct approach -- calling one of the schema loader tools. Because of that, they manage the `db::config` object on their own and don't enable the configuration option `rf_rack_valid_keyspaces`. That hasn't been a problem so far since the test doesn't attempt to create RF-rack-invalid keyspaces anyway. However, in an upcoming commit, we're going to further restrict views with tablets and require that the option is enabled. To prepare for that, we enable the option in all test cases. It's only necessary in a small subset of them, but it won't hurt the enforce it everywhere, so let's do that. Refs scylladb/scylladb#23958	2025-09-29 13:07:08 +02:00
Dawid Mędrek	a1254fb6f3	db/view: Name requirement for views with tablets We add a named requirement, a function, for materialized views with tablets. It decides whether we can create views and secondary indexes in a given keyspace. It's a stepping stone towards modifying the requirements for it. This way, we keep the code in one place, so it's not possible to forget to modify it somewhere. It also makes it more organized and concise.	2025-09-29 13:07:08 +02:00
Michał Chojnowski	4ca215abbc	sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader Partitions.db uses a piece of the murmur hash of the partition key internally. The same hash is used to query the bloom filter. So to avoid computing the hash twice (which involves converting the key into a hashable linearized form) it would make sense to use the same `hashed_key` for both purposes. This is what we do in this patch. We extract the computation of the `hashed_key` from `make_pk_filter` up to its parent `sstable_set_impl::create_single_key_sstable_reader`, and we pass this hash down both to `make_pk_filter` and to the sstable reader. (And we add a pointer to the `hashed_key` as a parameter to all functions along the way, to propagate it). The number of parameters to `mx::make_reader` is getting uncomfortable. Maybe they should be packed into some structs.	2025-09-29 13:01:22 +02:00
Michał Chojnowski	420e215873	sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash Partitions.db internally uses a piece of the partition key murmur hash (the same hash which is used to compute the token and the relevant bits in the bloom filter). Before this patch, the Partitions.db reader computes the hash internally from the `sstables::partition_key`. That's a waste, because this hash is usually also computed for bloom filter purposes just before that. So in this patch we let the caller pass that hash instead. The old index interface, without the hash, is kept for convenience. In this patch we only add a new interface, we don't switch the callers to it yet. That will happen in the next commit.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	cee4011e7a	sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller Partitions.db internally uses a piece of the partition key murmur hash (the same hash which is used to compute the token and the relevant bits in the bloom filter). Before this patch, the Partitions.db writer computes the hash internally from the `sstables::partition_key`. That's a waste, because this hash is also computed for bloom filter purposes just before that, in the owning sstable writer. So in this patch we let the caller pass that hash here instead.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	f8e3d5e7c2	sstables/mx: make Index and Summary components optional In previous patches we (hopefully) modified all users of Index and Summary components so that they don't longer need those components to exist. (And can use Partitions and Rows components instead).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	f003cbce6d	sstables: open Partitions.db early when it's needed to populate key range for sharding metadata If there's no metadata file with sharding metadata, the owning shards of an sstable are computed based on the partition key range within the sstable. This range is set in `set_first_and_last_keys()`, which (since another commit in this commit series) reads it either from the Summary component or from the footer of the Partitions component, whichever is available. But in some code paths `set_first_and_last_keys()` is called before the footer of Partitions is loaded. If the sstable doesn't have Summary, only Partitions, then the `set_first_and_last_keys()` will fail. To prevent that, in those cases we have to open the file and read its footer early, before the `set_first_and_last_keys()` calls. Note: the changes in this commit shouldn't matter during normal operation, in which a Scylla component with sharding metadata is available. But it might be used when old and/or incomplete sstables are read.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	4bdf5ca0cf	sstables: adapt sstable::set_first_and_last_keys to sstables without Summary `sstable::set_first_and_last_keys` currently takes the first and last key from the Summary component. But if only BTI indexes are used, this component will be nonexistent. In this case, we can use the first and last keys written in the footer of Partitions.db.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	b1984d6798	sstables: implement an alternative way to rebuild bloom filters for sstables without Index For efficiency, the cardinality of the bloom filter (i.e. the number of partition keys which will be written into the sstable) has to be known before elements are inserted into the filter. In some cases (e.g. memtables flush) this number is known exactly. But in others (e.g. repair) it can only be estimated, and the estimation might be very wrong, leading to an oversized filter. Because of that, some time ago we added a piece of logic (ran after the sstable is written, but before it's sealed) which looks at the actual number of written partitions, compares it to the initial estimate (on which the size of the bloom filter was based on), and if the difference is unacceptably large, it rewrites the bloom filter from partition keys contained in Index.db. But the idea to rebuild the bloom filters from index files isn't going to work with BTI indexes, because they don't store whole partition keys. If we want sstables which don't have Index.db files, we need some other way to deal with oversized filters. Partition keys can be recovered from Data.db, but that would often be way too expensive. This patch adds another way. We introduce a new component file, TemporaryHashes. This component, if written at all, contains the 16-byte murmur hash for every partition key, in order, and can be used in place of Index to reconstruct the bloom filter. (Our bloom filters are actually built from the set of murmur hashes of partition keys. The first step of inserting a partition key into a filter is hashing the key. Remembering the hashes is sufficient to build the filter later, without looking at partition keys again.) As of this patch, if the Index component is not being written, we don't allocate and populate a bloom filter during the Data.db write. Instead, we write the murmur hashes to TemporaryHashes, and only later, after the Data write finishes, we allocate the optimal-size, bloom filter, we read the hashes back from TemporaryHashes, and we populate the filter with them. That is suboptimal. Writing the hashes to disk (or worse, to S3) and reading them back is more expensive than building the bloom filter during the main Data pass. So ideally it should be avoided in cases where we know in advance that the partition key count estimate is good enough. (Which should be the case in flushes and compactions). But we defer that to a future patch. (Such a change would involve passing some flag to the sstable writer if the cardinality estimate is trustworthy, and not creating TemporaryHashes if the estimate is trustworthy).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	c549afa1a9	utils/bloom_filter: add `add(const hashed_key&)` In one of the next patches, we will want to use (in BTI partition index writer) the same hash as used by the bloom filter, and we'll also want to allow rebuilding the filter in a second pass (after the whole sstable is written) from hashes (as opposed to rebuilding from partition keys saved in Index.db, which is something we sometimes do today) saved to a temporary file. For those, we need an interface that allows us to compute the hash externally, and only pass the hash to `add()`.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	3c83914814	sstables: adapt estimated_keys_for_range to sstables without Summary Before this patch, `estimated_keys_for_range` assumes the presence of the Summary component. But we want to make this component optional in this series. This patch adds a second branch to this function, for sstables which don't have a BIG index (in particular, Summary component), but have a BTI index (Partitions component). In this case, instead of calculating the estimate as "fraction of summary overlapping with given range, multiplied by the total key estimate", we calculate it as "fraction of Data file overlapping with given range, multiplied by the total key estimate". (With an extra conditional for the special case when the given range doesn't overlap with the sstable's range at all. In this case, if the ranges are adjacent, the main path could easily return "1 partition" instead of "0 partitions", due to the inexactness of BTI indexes for range queries. Returning something non-zero in this case would be unfortunate, so the extra conditional makes sure that we return 0).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	55c4b89b88	sstables: make `sstable::estimated_keys_for_range` asynchronous Currently, `sstable::estimated_keys_for_range` works by checking what fraction of Summary is covered by the given range, and multiplying this fraction to the number of all keys. Since computing things on Summary doesn't involve I/O (because Summary is always kept in RAM), this is synchronous. In a later patch, we will modify `sstable::estimated_keys_for_range` so that it can deal with sstables that don't have a Summary (because they use BTI indexes instead of BIG indexes). In that case, the function is going to compute the relevant fraction by using the index instead of Summary. This will require making the function asynchronous. This is what we do in this patch. (The actual change to the logic of `sstable::estimated_keys_for_range` will come in the next patch. In this one, we only make it asynchronous).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	70994170e2	sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary `sstable::get_estimated_key_count()` estimates the partition count from the size of Summary, and the interval between Summary entries. But we want to allow writing sstables without a Summary (i.e. sstables that use BTI indexes instead of BIG indexes), so we want a way to get the key count without involving Summary. For that, we can use the `estimated_partition_size` histogram in Statistics. By counting the histogram entries, we get the exact number of partitions in the sstable.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	68c33c0173	replica/database: add table::estimated_partitions_in_range() Add a function which computes an estimated number of partitions in the given token range. We will use this helper in a later patch to replace a few places in the code which de facto do the same thing "manually".	2025-09-29 13:01:21 +02:00
Michał Chojnowski	5f4b9a03d1	sstables/mx: implement sstable::has_partition_key using a regular read A BTI index isn't able to determine if a given key is present in the sstable, because it doesn't store full keys. (It only stores prefixes of decorated keys, so it might give false positives). If the sstable only has BTI index, and no BIG index, then `sstable::has_partition_key()` will have to be implemented with with something else than just the index reader. We might as well ignore the index in any cases and just check that a regular data read for the given partition returns a non-empty result. `sstable::has_partition_key` is only used in the `column_family/sstables/by_key` REST API call that nobody uses anyway, no point in trying to make special optimizations for it.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	893eb4ca1f	sstables: use BTI index for queries, when present and enabled This patch teaches `sstable::make_index_reader` how to create a BTI index reader, from the the `Partitions.db` and `Rows.db` components, if they exist (in which case they are opened by this point).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	e0fda9ae6f	sstables/mx/writer: populate BTI index files In the previous patch we added code responsible for creating and opening Partitions.db and Rows.db, but we left those files empty. In this patch, we populate the files using `trie::bti_row_index_writer` and `trie::bti_partition_index_writer`. Note: for the row index, we insert the same clustering blocks to both indexes. The logic for choosing the size of the blocks hasn't been changed in any way. Much of this patch has to do with propagating the current range tombstone down to all places which can start a new clustering block. The reason we need that is that, for each clustering block, BIG indexes store the range tombstone succeeding the block (i.e. the range tombstone in between the given block and its successor) BTI indexes store the range tombstone preceding the block, (i.e. the range tombstone in between the given block and its predecessor). So before the patch there's no code which looks at the current tombstone when starting the block, only when ending the block. This patch adds an extra copy for each `decorated_key`. This is mostly unavoidable -- the BTI partition writer just has to remember the key until its successor appears, to find the common prefix. (We could avoid the key copy if the BTI isn't used, though. We don't do that in this patch, we just let the copy happen).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	cdcf34b3a0	sstables: create and open BTI index files, when enabled This patch adds code responsible for creation and opening of BTI index components (Rows.db, Partitions.db) when BTI index writing is enabled. (It is enabled if the cluster feature is enabled and the relevant config entry permits it). The files are empty for now, and are never read. We will populate and use them in following patches.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	18875621e8	sstables: introduce Partition and Rows component types BTI indexes are made up of Partition.db and Rows.db files. In this patch we introduce the corresponding component types. In Cassandra, BTI is a separate "sstable format", with a new set of versions. (I.e. `bti-da`, as opposed to `big-me`). In this patch series, we are doing something different: we are introducing version `ms`, which is like `me`, except with `Index.db` and `Summary.db` replaced with `Partitions.db` and `Rows.db`. With a setup like that, Scylla won't yet be able to read Cassandra's BTI (`da`) files, because this patch doesn't teach Scylla about `da`. (But the way to that is open. It would just require first implementing several other things which changed between `me` and `da`). (And, naturally Cassandra will reject `ms` sstables. But this isn't the first time we are breaking file compatibility with Cassandra to some degree. Other examples include encryption and dictionary compression). Note: Partitions.db and Rows.db contain prefixes of keys, which is sensitive information, so they have to be encrypted.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	e04ee6d5f6	sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time` There's a test (boost/sstable_compaction_test.cc::tombstone_purge_test) which tests the value of `_stats.capped_tombstone_deletion_time`. Before this patch, for "ms" sstables, `to_deletion_time` would have be called twice for each written partition tombstone, which would fail the test. Since `_pi_write_m.partition_tombstone` always ends up being converted from `tombstone` to `sstables::deletion_time` anyway, let's just make it a `sstables::deletion_time` to begin with. This will ensure that `to_deletion_time` will be able to be only called once per partition tombstone.	2025-09-29 13:01:20 +02:00
Piotr Dulikowski	3abe6eadce	Merge 'Add CQL documentation for vector queries using SELECT ANN' from Szymon Wasik This PR adds the missing documentation for the SELECT ... ANN statement that allows performing vector queries. This is just the basic explanation of the grammar and how to use it. More comprehensive documentation about vector search will be added separately in Scylla Cloud documentation and features description. Links to this additional documentation will be added as part of VECTOR-244. Fixes: VECTOR-247. No backport is needed as this is the new feature. Closes scylladb/scylladb#26282 * github.com:scylladb/scylladb: cql3: Update error messages to be in line with documentation. docs: Add CQL documentation for vector queries using SELECT ANN	2025-09-29 12:46:55 +02:00
Dario Mirovic	b3347bcf84	test: dtest: limits_test.py: test_max_cells log level Set `lsa-timing` logger log level to `debug`. This will help with the analysis of the whole spectrum of memory reclaim operation times and memory sizes. Refs #25097	2025-09-29 12:39:53 +02:00
Dario Mirovic	554fd5e801	test: dtest: limits_test.py: make the tests work Remove unused imports and markers. Remove Apache license header. Enable the test in suite.yaml for `dev` and `debug` modes. Refs #25097	2025-09-29 12:39:53 +02:00
Dario Mirovic	70128fd5c7	test: dtest: test_limits.py: remove test that are not being migrated Refs #25097	2025-09-29 12:39:52 +02:00
Dario Mirovic	82e9623911	test: dtest: copy unmodified limits_test.py Copy limits_test.py from scylla-dtest to test/cluster/dtest/limits_test.py. Add license header. Disable it for `debug`, `dev`, and `release` mode. Refs #25097	2025-09-29 12:39:52 +02:00
Michał Hudobski	eb8d60f5d4	metrics, test: added a test case for vs metrics This commit adds a test case that checks that vector search metrics are correctly added and have the correct value.	2025-09-29 12:29:21 +02:00
Michał Hudobski	fe4bfffca5	metrics, vector_search: add a dns refresh metric This commit adds a dns refresh counting metric to the vector_store service. We would like to track it to make sure that the networking is working correctly.	2025-09-29 12:28:52 +02:00
Michał Hudobski	74becdd04b	vector_search: move the ann implementation to impl The implementation of the ann function should have been placed in the impl struct, not in the client itself. This commit fixes that.	2025-09-29 12:26:42 +02:00
Piotr Dulikowski	3a05df742e	Merge 'Fix for auth version change during node startup' from Marcin Maliszkiewicz Before this patch we may trigger `SCYLLA_ASSERT(legacy_mode(_qp))`. That's because some auth startup is done in the background and assumes that auth version doesn't change in the middle of the startup. But topology coordinator may decide to do the migration at any time, regadless if auth service is fully started on all nodes. This change makes sure that in legacy startup flow we'll always use old auth-v1 keyspace and therefore auth version change in the middle won't negatively affect the flow. Fixes https://github.com/scylladb/scylladb/issues/25505 Closes scylladb/scylladb#25949 * github.com:scylladb/scylladb: auth: mark some auth-v1 functions as legacy auth: use old keyspace during auth-v1 consistently auth: document setting _superuser_created_promise flow in auth-v1	2025-09-29 11:41:27 +02:00
Ernest Zaslavsky	debc756794	treewide: Move transport related files to a `transport` directory As requested in #22112 , moved the files and fixed other includes and build system. Moved files: - generic_server.hh - generic_server.cc - protocol_server.hh Fixes: #22112 This is a cleanup, no need to backport Closes scylladb/scylladb#25090	2025-09-29 11:46:06 +03:00
Nadav Har'El	69672a5863	alternator: fix deprecation warning Until recently, Seastar's HTTP server's reply::write_body() only supported a few "well-known" content types. But Alternator uses a lesser known one - "application/x-amz-json-1.0" - so it was forced to use a wrong (but legal) content type, and later override it with the correct one. This was really ugly and we had a comment that once this feature was fixed in Seastar, we should remove the ugly workaround. Well, the time has finally come. We can now finally pass the correct content type to write_body(), and don't need to call the deprecated type-changing function later. The new implementation is less awkward, but actually longer - whereas previously we only set the content type in one place - just before the done(), after this patch we actually need to do it in three places where we write the body (string response, streaming response and error response). But I think this is actually better - there is no inherent reason why, for example, error messages and success messages needed to use the same content type. We use a new constant REPLY_CONSTANT_TYPE so that we don't need to repeat it three times. We already have a regression test for the content-type returned by Alternator, test_manual_requests.py::test_content_type, and this test continues to pass after the patch. But this test only checked the short response path, so we add additional tests for the streaming response path and for the error response path. As usual, the new tests pass on DynamoDB as well. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26268	2025-09-29 11:40:46 +03:00
Pavel Emelyanov	daea284072	Merge 'Make compaction module more self contained' from Botond Dénes There is still some compaction related code left in `sstables/`, move this to `compaction/` to make the compaction module more self-contained. Code cleanup, no backport. Closes scylladb/scylladb#26277 * github.com:scylladb/scylladb: sstables,compaction: move make_sstable_set() implementations to compactions/ sstables,compaction: move compaction exceptions to compaction/	2025-09-29 11:38:30 +03:00
Nadav Har'El	1aef733d48	Merge 'Alternator/cache expressions' from Szymon Malewski Before this patch, every expression in Alternator's requests was parsed from string to adequate structure. This patch enables caching, where input expression strings are mapped to parsed template structures. Every new valid (parsable) expression is added to the cache. The cache has limited (configurable) size - when it is reached, the least recently used entry is removed. When requested expression is in the cache, the copy of the template is returned - individual instances still need to be resolved (placeholders substituted with names and values). Caching is implemented for all expression types. The cache is per shard - shared for all operations, expression types, tables, users. Default cache size is 2000 entries per shard and it has configuration option `alternator_max_expression_cache_entries_per_shard` (0 means cache disabled). Basic metrics (total count of hits and misses for each expression type and number of evicted enries) are implemented. Cache features are tested in boost unit tests and overall expression caching is tested with Python tests - both mostly rely on metrics. refs #5023 `perf-alternator` test shows improvement (median): \| test \| throughput \| instructions_per_op \| cpu_cycles_per_op \| allocs_per_op \| \| ------ \| ---------------- \| ----------------------------- \| --------------------------- \| ------------------- \| \| read \| +6.0% \| -8.5% \| -7.0% \| -4.9% \| \| write \| +13.4% \| -17.6% \| -14.7% \| -7.4% \| \| write(lwt) \| +12.7% \| -7.9% \| -6.9% \| -2.8% \| \| write_rwm \| +5.4% \| -10.5% \| -7.3% \| -4.1% \| "read" had a ProjectionExpression with 10 column names, "write" had a UpdateExpression with 10 column names and "write_rmw" had both ConditionExpression and UpdateExpression. This patch also includes minor refactoring of other expressions related tests (https://github.com/scylladb/scylladb/issues/22494) - use `test_table_ss` instead of `test_table`. Fixes #25855. This is new feature - no backporting. Closes scylladb/scylladb#25176 * github.com:scylladb/scylladb: alternator: use expression caching alternator: adds expression cache implementation utils: extend lru_string_map utils: add lru_string_map alternator/expressions: error on parsing empty update expression alternator/expressions: fix single value condition expression parsing test/alternator: use `test_table_ss` instead of `test_table` in expressions related tests.	2025-09-29 11:36:31 +03:00
Nadav Har'El	8c99f807d6	test/alternator: regression test for DescribeTable's index schema In issue #5320, we reported a bug where DescribeTable returns the wrong schema for a GSI - it returned as a sort key an attribute which the user didn't actually ask to be a sort key, and was only added because of a requirement of Scylla's materialized-views implementation. We already had a test, test_gsi_2_describe_table_schema, that reproduces that issue. But that test only exercised the specific case that we knew had a bug. There is a risk that the fix to #5320 (which was recently merged) will actually break other cases - different combinations of base and GSI keys, or even LSI keys - and we won't have tests reproducing it. So this patch adds comprehensive regression tests for how DescribeTable shows GSIs and LSIs for all possible combinations of base keys and GSI/LSI keys. As we prove in test comments (and in code) we need to test 15 GSIs and 2 LSIs to test every possible combination. These tests aren't very slow, because we only need to create three base tables to test all these combinations. As usual, the new tests pass on DynamoDB. The new GSI test failed on Alternator before #5320 was fixed, but now passes. The fact all of its cases pass shows that the fix to #5320 didn't cause regressions in other types of GSIs or LSIs. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26047	2025-09-29 08:50:30 +03:00
Piotr Dulikowski	7b91dd9297	Merge 'vector_store_client: Add support for multiple URIs' from Karol Nowacki vector_store_client: Add support for multiple URIs The vector store client now supports a comma-separated list of URIs in the `vector_store_primary_uri` configuration option. It uses the vector store nodes from these URIs for load balancing and high availability, querying the next node if the current one fails. References: VECTOR-187 No backport is needed as this is a new feature. Closes scylladb/scylladb#26212 * github.com:scylladb/scylladb: vector_store_client: Rename host_port struct to uri vector_store_client: Add support for multiple URIs vector_store_client: Remove methods used only in tests	2025-09-29 07:40:45 +02:00
Avi Kivity	5fc3ef56c4	build: switch to Seastar API_LEVEL 8 (noncopyable_function in json) Seastar API level 8 changes a function type from std::function to noncopyable_function. Apply those changes in tree and update the build configuration. Closes scylladb/scylladb#26006	2025-09-29 08:33:49 +03:00
Pavel Emelyanov	c029afc6d8	Merge 'test.py: dtest: port cfid_test.py' from Evgeniy Naydanov As a part of the porting process remove unused markers. Explicitly enable auto snapshots for the test, as they are required for it. Enable the test in suite.yaml (run in dev mode only) Closes scylladb/scylladb#26248 * github.com:scylladb/scylladb: test.py: dtest: make cfid_test.py run using test.py As a part of the porting process remove unused markers. test.py: dtest: copy unmodified cfid_test.py	2025-09-29 08:26:21 +03:00
Botond Dénes	6ba1d686e6	sstables,compaction: move make_sstable_set() implementations to compactions/ Various compaction strategies still have their respective make_sstable_set() implementation in sstables/sstable_set.cc. Move them to the appropriate .cc files in compaction/, making the compaction module more self contained.	2025-09-29 06:49:14 +03:00
Botond Dénes	9c85046f93	sstables,compaction: move compaction exceptions to compaction/ sstables/exceptions.hh still hosts some compaction specific exception types. Move them over to the new compaction/exceptions.hh, to make the compaction module more self-contained.	2025-09-29 06:49:14 +03:00
Botond Dénes	2b4a140610	replica: move querier code to replica namespace The query namespace is used for symbols which span the coordinator and replica, or that are mostly coordinator side. The querier is mainly in this namespace due to its similar name, but this is a mistake which confuses people. Now that the code was moved to replica/, also fix the namespace to be namespace replica.	2025-09-29 06:44:52 +03:00
Botond Dénes	ee3d2f5b43	root,replica: mv querier to replica/ The querier object is a confusing one. Based on its name it should be in the query/ module and it is already in the query namespace. But this is actually a completely replica-side logic, implementing the caching of the readers on the replica. Move it to the replica module to make this more clear.	2025-09-29 06:33:53 +03:00
Michał Chojnowski	ff5add4287	sstables/trie: BTI-translate the entire partition key at once Delaying the BTI encoding of partition keys is a good idea, because most of the time they don't have to be encoded. Usually the token alone is enough for indexing purposes. But for the translation of the `partition_key` part itself, there's no good reason to make it lazy, especially after we made the translation of clustering keys eager in a previous commit. Let's get rid of the `std::generator` and convert all cells of the partition key in one go.	2025-09-29 04:10:40 +02:00
Michał Chojnowski	4e35220734	sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset() Using `std::generator` could incurs some unnecessary allocation or confuse the optimizer. Let's replace it with something simpler.	2025-09-29 04:10:40 +02:00
Michał Chojnowski	88c9af3c80	sstables/trie: perform the BTI-encoding of position_in_partition eagerly Applying lazy evaluation to the BTI encoding of clustering keys was probably a bad default. The benefits are dubious (because it's quite likely that the laziness won't allow us to avoid that much work), but the overhead needed to implement the laziness is large and immediate. In this patch we get rid of the laziness. We rewrite lazy_comparable_bytes_from_clustering_position so that it performs the translation eagerly, all components to a single bytes_ostream. Note: the name lazy_comparable_bytes_from_clustering_position stays, because the interface is still lazy. perf_bti_key_translation: Before: test iterations median mad min max allocs tasks inst cycles lcb_mismatch_test.lcb_mismatch 9233 109.930us 0.000ns 109.930us 109.930us 4356.000 0.000 2615394.3 614709.6 After: test iterations median mad min max allocs tasks inst cycles lcb_mismatch_test.lcb_mismatch 50952 19.487us 0.000ns 19.487us 19.487us 198.000 0.000 603120.1 109042.9	2025-09-29 04:10:40 +02:00
Michał Chojnowski	7d57643361	types/comparable_bytes: add comparable_bytes_from_compound Add a function which converts compound types (keys and key prefixes) to BTI encoding. It's almost the same as the existing `lazy_comparable_bytes_from_compound` (in bti_key_translation.cc), except it eagerly serializes key components to a bytes_ostream instead of lazily yielding them from a generator. We will remove `lazy_comparable_bytes_from_compound` in a later commit.	2025-09-29 04:10:38 +02:00
Michał Chojnowski	3703197c4c	test/perf: add perf_bti_key_translation Add a microbenchmark for translating keys to BTI encoding.	2025-09-29 04:08:00 +02:00
Michael Litvak	6bc41926e2	view_builder: reduce log level for expected aborts during view creation When draining the view builder, we abort ongoing operations using the view builder's abort source, which may cause them to fail with abort_requested_exception or raft::request_aborted exceptions. Since these failures are expected during shutdown, reduce the log level in add_new_view from 'error' to 'debug' for these specific exceptions while keeping 'error' level for unexpected failures. Closes scylladb/scylladb#26297	2025-09-28 22:55:07 +03:00
Avi Kivity	5b40d4d52b	Merge 'root,replica: mv multishard_mutation_query -> replica/multishard_query' from Botond Dénes The code in `multishard_mutation_query.cc` implements the replica-side of range scans and as such it belongs in the replica module. Take the opportunity to also rename it to `multishard_query`, the code implements both data and mutation queries for a long time now. Code cleanup, no backport required. Closes scylladb/scylladb#26279 * github.com:scylladb/scylladb: test/boost: rename multishard_mutation_query_test to multishard_query_test replica/multishard_query: move code into namespace replica replica/multishard_query.cc: update logger name docs/paged-queries.md: update references to readers root,replica: move multishard_mutation_query to replica/	2025-09-28 20:24:46 +03:00
Avi Kivity	5b6570be52	Merge 'db/config: Add SSTable compression options for user tables' from Nikos Dragazis ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well. This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality. Fixes #25195. Closes scylladb/scylladb#26003 * github.com:scylladb/scylladb: test/cluster: Add tests for invalid SSTable compression options test/boost: Add tests for SSTable compression config options main: Validate SSTable compression options from config db/config: Add SSTable compression options for user tables db/config: Prepare compression_parameters for config system compressor: Validate presence of sstable_compression in parameters compressor: Add missing space in exception message	2025-09-28 20:23:23 +03:00
Artsiom Mishuta	eedd61f43f	test.py: remove 'sudo' from resource_gather.py The container now runs as root (`4c1f4c419c`), so sudo it's not needed anymore Closes scylladb/scylladb#26294	2025-09-28 16:51:19 +03:00
Avi Kivity	1c1e8802d5	Merge 'Fix lifetime problems between group0 and sstable dictionary trainings' from Michał Chojnowski Apparently the group0 server object dies (and is freed) during drain/shutdown, and I didn't take that into account in my https://github.com/scylladb/scylladb/pull/23025, which still attempts to use it afterwards. The patch fixes two problems. The problem with `is_raft_leader` has been observed in tests. The problems with `publish_new_sstable_dict` has not been observed, but AFAIU (based on code inspection) it exists. I didn't attempt to prove its existence with a test. Should be backported to 2025.3. Closes scylladb/scylladb#25115 * github.com:scylladb/scylladb: storage_service: in publish_new_sstable_dict, use _group0_as instead of the main abort source storage_service: hold group0 gate in `publish_new_sstable_dict`	2025-09-28 14:27:37 +03:00
Szymon Malewski	6ce7843774	alternator: use expression caching Before this patch, every expression in Alternator's requests was parsed from string to adequate structure. This patch enables caching - all calls to parse an expression (all types) are proxied through the cache. New expression is added to the cache, the least recently used entry (above cache size) is removed. For existing entries the copy of the template is returned - individual instances still need to be resolved (placeholders substituted with names and values). The cache is per shard - shared for all operations, expression types, tables, users. Default cache size is 2000 entries per shard and it has configuration option `alternator_max_expression_cache_entries_per_shard` (0 means cache disabled). Added Python tests are based on metrics.	2025-09-28 04:27:44 +02:00
Szymon Malewski	b75c6c9ef7	alternator: adds expression cache implementation Every expression in Alternator's requests is parsed from string to adequate structure. This patch implements a caching structure (input expression strings mapping to parsed 'template' structures), which will be used for handling requests in following commits. If the reqested expression is valid (parsable) the cache will always return a value - if it is not already in the cache it will be created and stored. The cache has limited (live configurable) size - when it is reached, the least recently used entry is removed. The copy of the template in cache is returned - individual instances still need to be resolved (placeholders substituted with names and values). Invalid requests will have no effect on the cache - the parser throws an exception. Caching is implemented for all expression types. Internally it is based on helper structure `lru_string_map`. Basic metrics (total count of hits and misses for each expression type and number of evictions) are implemented. Metrics are used in boost unit tests.	2025-09-28 04:27:44 +02:00
Szymon Malewski	bb8004e52d	utils: extend lru_string_map This patch extend `lru_string_map` with `sized_string_map` - a class that helps to control cache size. It implements cache resizing in background thread.	2025-09-28 04:27:33 +02:00
Szymon Malewski	5332ceb24e	utils: add lru_string_map Adds a lru_string_map definition. This structure maps a string keys to templated arguments, allowing efficient lookup and adding keys. Each lookup (and adding) puts the keys on internal LRU list and the entires may be efficiently removed in a LRU order. It will be a base for the expression cache in Alternator.	2025-09-28 04:06:00 +02:00
Szymon Malewski	65c95d3a93	alternator/expressions: error on parsing empty update expression With this patch empty update expression is no longer accepted by the parser. So far it was rejected only after resolving, however it could pollute the expression cache.	2025-09-28 04:06:00 +02:00
Szymon Malewski	be159acc03	alternator/expressions: fix single value condition expression parsing Primitive conditions usually use operator with two or more values. The only case of a "single value" condition is a function call - DynamoDB does not accept other general values (i.e., attribute or value references). In Alternator single general value was parsed as correct and only failed later when the calculated value ended up to not be a boolean. This works, but not when attribute or value actually is boolean. What is more, when a parsed (but not resolved) expression is cached, this invalid expression could pollute cache. This would be also the only case where the same string can be parsed both as a condition and a projection expression. The issue is fixed by explicitly checking this case at primitive condition parsing. Updated test confirms consistence between Alternator and DynamoDB. Fixes #25855.	2025-09-28 04:06:00 +02:00
Szymon Malewski	7ed38155a3	test/alternator: use `test_table_ss` instead of `test_table` in expressions related tests. This patch includes minor refactoring of expressions related tests (#22494) - use `test_table_ss` instead of `test_table`.	2025-09-28 04:06:00 +02:00
Botond Dénes	34cc7aafae	tools/scylla-sstable: introduce the upgrade command An offline, scylla-sstable variant of nodetool upgradesstables command. Applies latest (or selected) sstable version and latest schema. Closes scylladb/scylladb#26109	2025-09-27 16:53:14 +03:00
Avi Kivity	24b5d08731	Merge 'Remove table::for_all_partitions_slow()' from Pavel Emelyanov This method was once implemented by calling table::for_all_partitions(), which was supposed to be non-slow version. Then callers of "non-slow" method were updated and the method itself was renamed into "_slow()" one. Nowadays only one test still uses it. At the same time the method itself mostly consists of a boilerplate code that moves bits around to call lambda on the partitions read from reader. Open-coding the method into the calling test results in much shorter and simpler to follow code. Code cleanup, no backport needed Closes scylladb/scylladb#26283 * github.com:scylladb/scylladb: test: Fix indentation after previous patch test: Opencode for_all_partitions_slow() test: Coroutinize test_multiple_memtables_multiple_partitions inner lambda table: Move for_all_partitions_slow() to test	2025-09-27 16:26:18 +03:00
Karol Nowacki	f8b1addfaf	vector_store_client: Rename host_port struct to uri The `host_port` struct represents the parsed components of the vector store URI. Renaming it to `uri` more accurately reflects its purpose.	2025-09-27 09:04:46 +02:00
Karol Nowacki	27f6459766	vector_store_client: Add support for multiple URIs The vector store client now supports a comma-separated list of URIs in the `vector_store_primary_uri` configuration option. It uses the vector store nodes from these URIs for load balancing and high availability, querying the next node if the current one fails.	2025-09-27 09:04:45 +02:00
Karol Nowacki	a310cb4c64	vector_store_client: Remove methods used only in tests The `vector_store_client::port()` and `vector_store_client::host()` methods were only used in the test code. Moreover, these tests are no longer needed, as the proper parsing of the URI is already tested in other tests that perform requests to the vector store server mock.	2025-09-27 08:47:00 +02:00
Piotr Dulikowski	39145ff1d0	Merge 'vector_store_client: Add support for load balancing' from Karol Nowacki This change introduces a load balancing mechanism for the vector store client. The client can now distribute requests across multiple vector store nodes. The distribution mechanism performs random selection of nodes for each request. References: VECTOR-187 No backport is needed as this is a new feature. Closes scylladb/scylladb#26205 * github.com:scylladb/scylladb: vector_store_client: Add support for load balancing vector_store_client_test: Introduce vs_mock_server vector_store_client_test: Relocate to a dedicated directory	2025-09-26 18:55:14 +02:00
Petr Gusev	c1cc52c8c8	lwt: prohibit for tablet-based views and cdc logs SELECT commands with SERIAL consistency level are historically allowed for vnode-based views, even though they don't provide linearizability guarantees. We prohibit LWTs for tablet-based views, but preserve old behavior for vnode-based view for compatibility. Similar logic is applied to CDC log tables. Fixes scylladb/scylladb#26258	2025-09-26 17:06:58 +02:00
Szymon Wasik	ccfe80ab97	cql3: Update error messages to be in line with documentation. ANN (aproximate nearest neighborhood) is just the name of the type of algorithm used to perform vector search. For this reason the error messages should refer to vector queries rather than ANN queries.	2025-09-26 17:01:10 +02:00
Petr Gusev	8adbb6c4dd	tablets: disallow chains of colocated tables	2025-09-26 16:52:43 +02:00
Petr Gusev	b01f56a6d3	database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda In upcoming commits we’ll add a test to ensure that a table cannot be colocated with another table that is itself already colocated. This must also hold in the case where both colocated tables are created simultaneously in a single migration_manager announcement. We use Paxos tables as an example of colocated tables in this test. To support this, get_base_table_for_tablet_colocation needs to look for the base table among the batch of tables being created.	2025-09-26 16:46:32 +02:00
Pavel Emelyanov	04a40b08f7	test: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-26 16:39:09 +03:00
Pavel Emelyanov	813619f939	test: Opencode for_all_partitions_slow() The method is a large boilerplate that moves stuff around to do simple thing -- read mutations from reader in a row and "check" them with a lambda, optionally breaking the loop if lambda wants it. The whole thing is much shorter if the caller kicks reader itsown. One thing to note -- reader is not closed if something throws in between, but that's test anyway, if something throws, test fails and not closed reader is not a big deal. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-26 16:36:58 +03:00
Pavel Emelyanov	c1ebf987a9	test: Coroutinize test_multiple_memtables_multiple_partitions inner lambda The only place where it needs futures is to call the for_all_partitions_slow() from a table Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-26 16:35:59 +03:00
Pavel Emelyanov	f3c57f7dd0	table: Move for_all_partitions_slow() to test It's now only used by a single test, so move it there and remove from public table API. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-26 16:33:25 +03:00
Szymon Wasik	0194a53659	docs: Add CQL documentation for vector queries using SELECT ANN This patch adds the missing documentation for the SELECT ... ANN statement that allows performing vector queries. This is just the basic explanation of the grammar and how to use it. More comprehensive documentation about vector search will be added separately in Scylla Cloud documentation and features description. Links to this additional documentation will be added as part of VECTOR-244. Fixes: VECTOR-247.	2025-09-26 15:07:00 +02:00
Marcin Maliszkiewicz	5f0041d068	auth: mark some auth-v1 functions as legacy	2025-09-26 14:40:53 +02:00
Marcin Maliszkiewicz	793a64a50e	auth: use old keyspace during auth-v1 consistently Before this patch we may trigger assertion on legacy_mode(_qp). That's because some auth startup is done in the background and assumes that auth version doesn't change in the middle of the startup. But topology coordinator may decide to do the migration at any time, regadless if auth service is fully started on all nodes. This change makes sure that in legacy startup flow we'll always use old auth-v1 keyspace and therefore auth version change in the middle won't negatively affect the flow.	2025-09-26 14:40:52 +02:00
Karol Nowacki	a0e62ef8de	vector_store_client: Add support for load balancing This change introduces a load balancing mechanism for the vector store client. The client can now distribute requests across multiple vector store nodes. The distribution mechanism performs random selection of nodes for each request.	2025-09-26 13:44:28 +02:00
Karol Nowacki	ee90530c31	vector_store_client_test: Introduce vs_mock_server Introduce the `vs_mock_server` test class, which is capable of counting incoming requests. This will be used in subsequent tests to verify load balancing logic.	2025-09-26 12:27:06 +02:00
Nikos Dragazis	8410532fa0	test/cluster: Add tests for invalid SSTable compression options Complementary to the previous patch. It triggers semantic validation checks in `compression_parameters::validate()` and expects the server to exit. The tests examine both command line and YAML options. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-26 12:02:42 +03:00
Nikos Dragazis	6ba0fa20ee	test/boost: Add tests for SSTable compression config options Since patch `03461d6a54`, all boost unit tests depending on `cql_test_env` are compiled into a single executable (`combined_tests`). Add the new test in there. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-26 12:02:42 +03:00
Nikos Dragazis	8d5bd212ca	main: Validate SSTable compression options from config `compression_parameters` provides two levels of validation: * syntactic checks - implemented in the constructor * semantic checks - implemented by `compression_parameters::validate()` The former are applied implicitly when parsing the options from the command line or from scylla.yaml. The latter are currently not applied, but they should. In lack of a better place, apply them in main, right after joining the cluster, to make sure that the cluster features have been negotiated. The feature needed here is the `SSTABLE_COMPRESSION_DICTS`. Validation will fail if the feature is disabled and a dictionary compression algorithm has been selected. Also, mark `validate()` as const so that it can be called from a config object. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-26 12:02:42 +03:00
Nikos Dragazis	e1d9c83406	db/config: Add SSTable compression options for user tables ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size (refer to the default constructor for `compression_parameters`). The same default applies to system tables as well. Add a new configuration option to allow customizing the default for user tables. Use the previously hardcoded default as the new option's default value. Note that the option has no effect on ALTER TABLE statements. An altered table either inherits explicit compression options from the CQL statement, or maintains its existing options. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-26 12:02:00 +03:00
Botond Dénes	52c05d89aa	test/boost: rename multishard_mutation_query_test to multishard_query_test	2025-09-26 11:15:38 +03:00
Botond Dénes	3be4f0698f	replica/multishard_query: move code into namespace replica Complete the migration, add code to the replica namespace too.	2025-09-26 11:15:38 +03:00
Botond Dénes	ed50a307db	replica/multishard_query.cc: update logger name To reflect the new file name.	2025-09-26 11:15:38 +03:00
Botond Dénes	8f90137e87	docs/paged-queries.md: update references to readers Both links to reader code are outdated, update them.	2025-09-26 11:15:38 +03:00
Botond Dénes	fb16c0a6d4	root,replica: move multishard_mutation_query to replica/ It belongs there, it is a completely replica-side thing. Also take the opportunity to rename it to multishard_query.{hh,cc}, it is not just mutation anymore (data query is also implemented).	2025-09-26 11:15:38 +03:00
Piotr Dulikowski	68d5dcfa23	Merge 'Coroutinize gossipring property file snitch' from Pavel Emelyanov Most of it's then-chains are quire hairy and look much nicer as coroutines. Last patch restores indentation. Code cleanup, no backport required. Closes scylladb/scylladb#26271 * github.com:scylladb/scylladb: snitch: Reindent after previous changes snitch: Make periodic_reader_callback() a coroutine snitch: Coroutinize pause_io() snitch: Coroutinize stop() snitch: Coroutinize reload_configuration() snitch: Coroutinize read_property_file() snitch: Coroutinize start() snitch: Coroutinize property_file_was_modified()	2025-09-26 08:32:19 +02:00
Avi Kivity	0f4363cc8d	Merge 'sstable: add more complete schema to scylla component' from Botond Dénes Sstables store a basic schema in the statistics component. The scylla-sstable tool uses this to be able to read and dump sstables in a self-contained manner, without requiring an external schema source. The problem is that the schema stored int he statistics component is incomplete: it doesn't store column names for key columns, so these have placeholder names in dump outputs where column names are visible. This is not a disaster but it is confusing and it can cause errors in scripts which want to check the content of sstables, while also knowing the schema and expecting the proper names for key columns. To make sstables truly self-contained w.r.t. the schema, add a complete schema to the scylla component. This schema contains the names and types of all columns, as well as some basic information about the schema: keyspace name, table name, id and version. When available, scylla-sstable's schema loader will use this new more complete schema and fall-back to the old method of loading the (incomplete) schema from the statistics component otherwise. New feature, no backport required. Closes scylladb/scylladb#24187 * github.com:scylladb/scylladb: test/boost/schema_loader_test: add specific test with interesting types test/lib/random_schema: add random_schema(schema_ptr) constructor test/boost/schema_loader_test: test_load_schema_from_sstable: add fall-back test tools/schema_loader: add support for loading from scylla-metadata tools/schema_loader: extract code which load schema from statistics sstables: scylla_metadata: add schema member	2025-09-26 00:21:17 +03:00
Artsiom Mishuta	f23d19e248	test.py: fix dumping big logs to output 1. Remove dumping cluster logs and print only the link to the log. 2. Fail the test (to fail CI and not ignore the problem) and mark the cluster as dirty (to avoid affecting subsequent tests) in case setup/teardown fails. 3. Add 2 cqlpy tests that fail after applying step 2 to the dirties_cluster list so the cluster is discarded afterward. Closes scylladb/scylladb#26183	2025-09-25 22:36:46 +03:00
Pavel Emelyanov	78d32c52f2	test: Use map_reduce0 in sstable_directory_test.cc (and coroutinize) There's a code that tries to accumulate some counter across a sharded service by hand. Using map_reduce0() looks nicer and avoids the smp-safe atomic counter. Also -- coroutinuze the thing while at it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26259	2025-09-25 21:22:12 +03:00
Pavel Emelyanov	56547992b9	snitch: Reindent after previous changes Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	234865d13c	snitch: Make periodic_reader_callback() a coroutine It was a void method called from timer that spawned a fiber into a background. Now make it a coroutine, but spawn it to background by caller. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	3bd4673b03	snitch: Coroutinize pause_io() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	f7608261b4	snitch: Coroutinize stop() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	46860b461c	snitch: Coroutinize reload_configuration() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	90751cd499	snitch: Coroutinize read_property_file() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	04d2502e8f	snitch: Coroutinize start() And merge two if branches that call set_snitch_ready() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	48f7614ea9	snitch: Coroutinize property_file_was_modified() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:47 +03:00
Michael Litvak	ad1a5b7e42	service/qos: set long timeout for auth queries on SL cache update pass an appropriate query state for auth queries called from service level cache reload. we use the function qos_query_state to select a query_state based on caller context - for internal queries, we set a very long timeout. the service level cache reload is called from group0 reload. we want it to have a long timeout instead of the default 5 seconds for auth queries, because we don't have strict latency requirement on the one hand, and on the other hand a timeout exception is undesired in the group0 reload logic and can break group0 on the node. Fixes scylladb/scylladb#25290	2025-09-25 16:55:29 +02:00
Michael Litvak	3c3dd4cf9d	auth: add query_state parameter to query functions add a query_state parameter to several auth functions that execute internal queries. currently the queries use the internal_distributed_query_state() query state, and we maintain this as default, but we want also to be able to pass a query state from the caller. in particular, the auth queries currently use a timeout of 5 seconds, and we will want to set a different timeout when executed in some different context.	2025-09-25 16:46:50 +02:00
Michael Litvak	a1161c156f	auth: refactor query_all_directly_granted rewrite query_all_directly_granted to use execute_internal instead of query_internal in a style that is more consistent with the rest of the module. This will also be useful for a later change because execute_internal accepts an additional parameter of query_state.	2025-09-25 16:37:04 +02:00
Szymon Wasik	7c4ef9aae7	docs: Add documentation for creating vector search indexes This patch adds CQL documentation about creating vector search indexes. It includes the syntax and description of parameters. It does not cover VECTOR type that is already supported and documented and it does not cover querying vectors which will be covered by a separate PR. Fixes: VECTOR-217 Closes scylladb/scylladb#26233	2025-09-25 14:49:50 +02:00
Pavel Emelyanov	6670090581	Merge 'compaction: move code to namespace compaction' from Botond Dénes The namespace usage in this directory is very inconsistent, with files and classes scattered in: * global namespace * namespace compaction * namespace sstables With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too. This patch, although large, is mechanic and only the following kind of changes are made: * replace namespace sstable {} with namespace compaction {} * add namespace compaction {} * drop/add sstables:: * drop/add compaction:: * move around forward-declarations so they are in the correct namespace context This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented. Code cleanup, no backport. Closes scylladb/scylladb#26214 * github.com:scylladb/scylladb: compaction: remove using namespace {compaction,sstables} compaction: move code to namespace compaction	2025-09-25 15:10:11 +03:00
Karol Nowacki	c4e13959ab	vector_store_client_test: Relocate to a dedicated directory The `vector_store_client_test` is moved from `test/boost` to a new `test/vector_search` directory. This change prepares a dedicated location for all upcoming tests related to the vector search feature.	2025-09-25 14:04:28 +02:00
Botond Dénes	1999d8e3d3	compaction: remove using namespace {compaction,sstables} Some files in compaction/ have using namespace {compaction,sstables} clauses, some even in headers. This is considered bad practice and muddies the namespace use. Remove them.	2025-09-25 15:03:57 +03:00
Botond Dénes	86ed627fc4	compaction: move code to namespace compaction The namespace usage in this directory is very inconsistent, with files and classes scattered in: * global namespace * namespace compaction * namespace sstables With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too. This patch, although large, is mechanic and only the following kind of changes are made: * replace namespace sstable {} with namespace compaction {} * add namespace compaction {} * drop/add sstables:: * drop/add compaction:: * move around forward-declarations so they are in the correct namespace context This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented.	2025-09-25 15:03:56 +03:00
Pavel Emelyanov	b30c8a1f25	test: Add validation of data returned by /storage_service endpoints The test compares the ranges that are returned from /describe_ring and /range_to_endpoint_map with the information obtained from system.tablets and makes sure that - the number of ranges - the boundary tokens - the target replicas (nodes only) match. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 14:53:22 +03:00
Nadav Har'El	265660d22f	test/cluster: greatly speed up test_localnodes_joining_nodes The test cluster/test_alternator::test_localnodes_joining_nodes takes a whopping 2 minutes and 9 seconds to run before this patch. After this patch, it takes just 7 seconds. The slowness of this test was caused by booting a second node that hangs during boot for 2 minutes, deliberately. We never intended for this boot to finish (the whole point of this test is to run before it finishes), but unfortunately had to wait for it to avoid all sort of nasty problems with unwaited futures. As comments already explained in the code, the solution to this problem is to kill the server at the end of the test - after we kill it, we can wait for it - this wait will very quickly notice that the server addition failed, and not need to wait 2 minutes. But until the previous patch, we had no API to find the server which is starting (not yet running), or to kill it. After the previous patch, we do have such an API, and can now use it, and see this test finish in 7 seconds instead of 2 minutes and 9 seconds.	2025-09-25 14:00:16 +03:00
Nadav Har'El	aa8d6e9e74	test/pylib: add the ability to stop currently-starting servers Some tests need the ability to abruptly stop a server in the test cluster before it fully booted - e.g., because the test knows (and perhaps even expects) that the boot is hung. But before this patch, manager.server_stop() could only kill servers in "running" state. This patch adds to pylib tracking of "starting" servers - servers which we are starting but haven't finished booting - their list can be returned by the manager.starting_servers(). The manage.server_stop function can now kill a server which is just starting - not just "running" servers. To avoid breaking existing tests, manager.all_servers() continues to return just running and stopped servers - not "starting" servers. By the way, when a starting server is killed, it is not listed as stopped - it just behaves as a normal failure to add the server, and not as a server which successfully joined the cluster but was later stopped. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-09-25 14:00:16 +03:00
Lakshmi Narayanan Sreethar	6d3a8e89e7	cmake: introduce `Scylla_WITH_DEBUG_INFO` option The `configure.py` script has an `--debuginfo` option that allows overriding compiler debug information generation, regardless of the build mode. Add a similar option to CMake, and ensure it is set when CMake is invoked from `configure.py` with `--debuginfo` enabled. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26243	2025-09-25 12:49:36 +03:00
Łukasz Paszkowski	62e27e0f77	test_out_of_space_prevention.py: Fix flaky test_node_restart_while_tablet_split test The test starts a 3-node cluster and immediately creates a big file on the first nodes in order to trigger the out of space prevention to disable compaction, including the SPLIT compaction. In order to trigger a SPLIT compaction, a keyspace with 1 initial tablet is created followed by alter statement with `tablets = {'min_tablet_count': 2}`. This triggers a resize decision that should not finalize due to disabled compaction on the first node. The test is flaky because, the keyspace is created with RF=1 and there is no guarantee that the tablet replica will be located on the first node with critical disk utilization. If that is not the case, the split is finalized and the test fails, because it expect the split to be blocked. Change to RF=3. This ensures there is exactly one tablet replica on each node, including the one with critical disk utilization. So SPLIT is blocked until the disk utilization on the first node, drops below the critical level. Fixes: https://github.com/scylladb/scylladb/issues/25861 Closes scylladb/scylladb#26225	2025-09-25 11:54:48 +03:00
Nadav Har'El	f65998cd39	alternator: improve error handling of incorrect ARN Before this patch, if an ARN that is passed to Alternator requests like TagResource is well-formatted but points to non-existent table, Alternator returns the unhelpful error: (AccessDeniedException) when calling the TagResource operation: Incorrect resource identifier This patch modifies this error to be: (ResourceNotFoundException) when calling the TagResource operation: ResourceArn 'arn:scylla:alternator:alternator_alternator_Test_ 1758532308880:scylla:table/alternator_Test_1758532308880x' not found This is the same error type (ResourceNotFoundException) that DynamoDB returns in that case - and a more helpful error message. This patch also includes a regression test that checks the error type in this case. The new test fails on Alternator before this patch, and passes afterwards (and also passes on DyanamoDB). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26179	2025-09-25 11:51:17 +03:00
Botond Dénes	80bc1af05a	test/boost/schema_loader_test: add specific test with interesting types Although the existing random schema test should cover all types, it is good to have targeted tests for more interesting types.	2025-09-25 11:28:35 +03:00
Botond Dénes	f10af4b5eb	test/lib/random_schema: add random_schema(schema_ptr) constructor Allow using the convenient random data generation facilities, for any schema.	2025-09-25 11:28:34 +03:00
Botond Dénes	4c9da11bfb	test/boost/schema_loader_test: test_load_schema_from_sstable: add fall-back test The test now tests loading the schema from the scylla component by default. Force testing the fall-back (read schema from statistics) by deleting the Scylla.db component. Also improve the test by comparing the column names and types, to check that when loaded from the scylla component, the key names are also correct.	2025-09-25 11:28:34 +03:00
Botond Dénes	b85d858f6d	tools/schema_loader: add support for loading from scylla-metadata When available, load the schema from the Scylla component, where the column names of keys are also available. Fall-back to loading the schema from the Statistics component otherwise (previous behaviour).	2025-09-25 11:28:34 +03:00
Botond Dénes	ace2ba06c3	tools/schema_loader: extract code which load schema from statistics Soon there will be an alternative method too: load from scylla-metadata.	2025-09-25 11:28:34 +03:00
Botond Dénes	234f905fa4	sstables: scylla_metadata: add schema member To store the most important schema fields, like id, version, keyspace name, table name and the list of all columns, along with their kind, name and type. This will serve as alternative schema source to the one stored in statistics component. This latter one doesn't store any of the metatada and neither does primary key names (just the types), so it is leads to confusion when it is used as schema source for scylla-sstable. This new schema stored in the scylla-metadata component is not intended to be a full-schema, equivalent to the one stored in the schema tables, it is intended to be good enough for scylla-sstable being able to parse sstables in a self-sufficient manner.	2025-09-25 11:28:34 +03:00
Botond Dénes	9908eb3b75	Merge 'Coroutinize filesystem_storage::check_create_links_replay()' from Pavel Emelyanov Shorter and cleaner this way. The method is doing parallel_for_each(some_lambda) and the PR only touches the lambda, the outer invocation is probably not worth it to convert plain return into a co_await Coroutinization enhancement, no need to backport Closes scylladb/scylladb#26188 * github.com:scylladb/scylladb: sstables: Restore indentation after previous patch sstables: Coroutinize filesystem_storage::check_create_links_replay()	2025-09-25 11:05:52 +03:00
Asias He	b31e651657	repair: Always reset node ops progress to 100% upon completion Always set the node ops progress to 100% when the operation finishes, regardless of success or failure. This ensures the progress never remains below 100%, which would otherwise indicates a pending node operation in case of an error. Fixes #26193 Closes scylladb/scylladb#26194	2025-09-25 11:05:52 +03:00
Botond Dénes	50038ef2cc	Merge 'alternator: update references to alternator streams issue' from Michael Litvak update all the references about the issue of tablets support for alternator streams to issue https://github.com/scylladb/scylladb/issues/23838 instead of https://github.com/scylladb/scylladb/issues/16317. The issue https://github.com/scylladb/scylladb/issues/16317 is about support of CDC with tablets, but it is now closed and it didn't address alternator streams. the remaining issues about alternator streams should be addressed as part of https://github.com/scylladb/scylladb/issues/23838, so fix the references in order for them not to be missed. backport is not needed Closes scylladb/scylladb#26178 * github.com:scylladb/scylladb: test/cqlpy/test_permissions: unskip test for tablets alternator: update references to alternator streams issue	2025-09-25 11:05:52 +03:00
Botond Dénes	f7fd12c2f5	Merge 'test: fix test_one_big_mutation_corrupted_on_startup' from Cezar Moise The commitlog in the tests with big mutations were corrupted by overwriting 10 chunks of 1KB with random data, which could not be enough due to randomness and the big size of the commitlog (~65MB). - change `corrupt_file` to overwrite a based on a percentage of the file's size instead of fixed number of chunks - fix typos - cleanup comments for clarity Closes: #25627 Closes scylladb/scylladb#25979 * github.com:scylladb/scylladb: test: cleanup big mutation commitlog tests test: fix test_one_big_mutation_corrupted_on_startup	2025-09-25 11:05:52 +03:00
Botond Dénes	8d0913cdfe	Merge 'Fix: small grammatical changes' from Sayanta Banerjee Fixed some minor grammatical changes in the documentation. Closes scylladb/scylladb#24675 * github.com:scylladb/scylladb: Update docs/features/cdc/cdc-streams.rst Small grammatical changes	2025-09-25 11:05:51 +03:00
Dani Tweig	a10cac9c0a	Update urgent_issue_reminder.yml - run daily The action will run daily, alerting about urgent issues not touched in the last 7 days. Closes scylladb/scylladb#25598	2025-09-25 11:05:51 +03:00
Asias He	4f3d076dab	tablets: Demote set sstables_repaired_at log to debug level This is log is too excessive when tablet count is high. Demote to debug level. Fixes #25926 Closes scylladb/scylladb#26175	2025-09-25 11:05:51 +03:00
Ferenc Szili	d462dc8839	docs: add description of number of tablets computed by tablet allocator This change adds the documentation section which explains the algorithm to compute the absolute number of tablets a table has. Fixes: #25740 Closes scylladb/scylladb#25741	2025-09-25 11:05:51 +03:00
Pavel Emelyanov	8438c59ad3	scylla-gdb: Fix fair-queue entry printing Catching a live entry in IO queue is very rare event, so we haven't seen it so far, but the `_ticket` member had been removed ~2 years ago and had been replaced with `_capacity` which is plain 64bit integer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26185	2025-09-25 11:05:51 +03:00
Botond Dénes	118561dd4f	Merge 'configure.py: fix passing of user linker flags to cmake' from Lakshmi Narayanan Sreethar Any user provided linker flags are converted to a semicolon separated string by the `configure.py` script and then passed to cmake via the `CMAKE_EXE_LINKER_FLAGS` option. But `CMAKE_EXE_LINKER_FLAGS` expects semicolon separated values only when set from within CMakeLists. When the option is set from the shell, which is the case with `configure.py` executing cmake, the values should be separated by space. So, pass the flags as it is instead of separating them with semicolons. `configure.py` also incorrectly concatenates the user linker flags and internal linker flags without a space in between causing flags like '-gz' and '-fuse-ld=lld' to merge into a single invalid argument. Fix that by using a proper space when concatenating these two flags. Fixes #26232 Fix to a dev build option. Backport not required. Closes scylladb/scylladb#26234 * github.com:scylladb/scylladb: configure.py: fix concatenation of user linker flags configure.py: fix passing of user linker flags to cmake	2025-09-25 11:05:51 +03:00
Lakshmi Narayanan Sreethar	690546fa40	cmake: link vector_search library to cql3 library The `indexed_table_select_statement::actually_do_execute()` method references `vector_search::vector_store_client::ann()`, but the `vector_search` library, which provides its definition, is not linked with the `cql3` library. This causes linker errors when other targets are built, for example linking `comparable_bytes_test`, which links the `types` library that in turn links `cql3` throws the following error : ``` ...error: undefined symbol: vector_search::vector_store_client::ann... ``` Fix by adding `vector_search` to the private link libraries of `cql3`. Fixes #26235 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26237	2025-09-25 11:05:51 +03:00
Pavel Emelyanov	d670e01bab	api: Handle stop compaction endpoint without database help The handler in question calls replica::database's invoke_on_all and then gets compaction manager from local db and finds the table object from it as well. The latter is needed to provide filter function for compaction_manager::stop_compaction() call and stop only compactions for specific table. Using replica::database can be avoided here (that's the part of dropping http_context -> database dependency eventually): - using sharded<compaction_manager> instead, it's c.m. that's needed on all shards, not really the database - don't search for table object on db, instead get table ID from parsed table_info instead to provide the correct filter function (continuation of #25846) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26082	2025-09-25 11:05:50 +03:00
Pavel Emelyanov	8f815de1e0	Merge 'treewide: move away from accessing httpd::request::query_parameters' from Botond Dénes Acecssing this member directly is deprecated, migrate code to use {get,set}_query_param() and friends instead. Fixes: https://github.com/scylladb/scylladb/issues/26023 Preparation for seastar update, no backport required. Closes scylladb/scylladb#26024 * github.com:scylladb/scylladb: treewide: move away from accessing httpd::request::query_parameters test/pylib/s3_server_mock.py: better handle empty query params	2025-09-25 11:05:50 +03:00
Łukasz Paszkowski	5f6df4eb97	test/storage: Properly mount/clear volumes Due to a missing functionality in PythonTest, `unshare` is never used to mount volumes. As a consequence: + volumes are created with sudo which is undesired + they are not cleared automatically Even having the missing support in place, the approach with mounting volumes with `unshare` would not work as http server, a pool of clusters, and scylla cluster manager are started outside of the new namespace. Thus cluster would have no access to volumes created with `unshare`. The new approach that works with and without dbuild and does not require sudo, uses the following three commands to mount a volume: truncate -s 100M /tmp/mydevice.img mkfs.ext4 /tmp/mydevice.img fuse2fs /tmp/mydevice.img test/ Additionally, a proper cleanup is performed, i.e. servers are stopped gracefully and and volumes are unmounted after the tests using them are completed. Fixes: https://github.com/scylladb/scylladb/issues/25906 Closes scylladb/scylladb#26065	2025-09-25 11:05:50 +03:00
Wojciech Przytuła	e5e913ab8e	Update CPP-RS Driver's link to documentation As the proper documentation of CPP-RS Driver is already there, let's update the links to point to it instead of the GitHub repo. Closes scylladb/scylladb#26089	2025-09-25 11:05:50 +03:00
Marcin Maliszkiewicz	2c6e1402af	auth: document setting _superuser_created_promise flow in auth-v1	2025-09-25 10:05:39 +02:00
Evgeniy Naydanov	eea166c809	test.py: dtest: make cfid_test.py run using test.py As a part of the porting process remove unused markers. Explicitly enable auto snapshots for the test, as they are required for it. Enable the test in suite.yaml (run in dev mode only)	2025-09-25 11:04:00 +03:00
Evgeniy Naydanov	18723b41cf	test.py: dtest: copy unmodified cfid_test.py	2025-09-25 10:33:18 +03:00
Łukasz Paszkowski	29de947851	test_out_of_space_prevention.py: Fix flaky test_user_writes_rejection test The test starts a 3-node cluster and immediately creates a big file on one of the nodes, to trigger the out of space prevention to start rejecting writes on this node. Then a write is executed and checked it did not reach the node with critical disk utilization but reached the remaining nodes (it should, RF=3 is set) However, when not specified, a default LOCAL_ONE consistency level is used. This means that only one node is required to acknowledge the write. After the write, the test checks if the write + did NOT reach the node with critical disk utilization (works) + did reach the remaning nodes This can cause the test to fail sporadically as the write might not yet be on the last node. Use CL=QUORUM instead. Fixes: https://github.com/scylladb/scylladb/issues/26004 Closes scylladb/scylladb#26030	2025-09-25 08:05:45 +03:00
Lakshmi Narayanan Sreethar	5f6e1edd93	configure.py: fix concatenation of user linker flags `configure.py` incorrectly concatenates the user linker flags and internal linker flags without a space in between causing flags like '-gz' and '-fuse-ld=lld' to merge into a single invalid argument. Fix that by using a proper space when concatenating these two flags. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-24 20:56:59 +05:30
Lakshmi Narayanan Sreethar	0c23d4f22d	configure.py: fix passing of user linker flags to cmake Any user provided linker flags are converted to a semicolon separated string by the `configure.py` script and then passed to cmake via the `CMAKE_EXE_LINKER_FLAGS` option. But `CMAKE_EXE_LINKER_FLAGS` expects semicolon separated values only when set from within CMakeLists. When the option is set from the shell, which is the case with `configure.py` executing cmake, the values should be separated by space. So, pass the flags as it is instead of separating them with semicolons. Fixes #26232 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-24 20:40:13 +05:30
Piotr Dulikowski	5d5244abaf	Merge 'vector_store_client: Add support for multiple IPs in DNS responses' from Karol Nowacki vector_store_client: Add support for multiple IPs in DNS responses The DNS resolution logic now processes all IP addresses returned in a DNS response, not just the primary one. The client will iterate through the list of resolved IPs, attempting to query the next one if a request fails. This improves high availability by allowing the client to query other available nodes if one is down. References: VECTOR-187 As this is a new feature no backport is needed. Closes scylladb/scylladb#26055 * github.com:scylladb/scylladb: vector_store_client: Rename HTTP_REQUEST_RETRIES to ANN_RETRIES vector_store_client: Format with clang-format vector_store_client: Add support for multiple IPs in DNS responses vector_store_client_test: Extract `make_vs_server` helper function vector_store_client_test: Ensure cleanup on exception vector_store_client_test: Fix unreliable unavailable port tests	2025-09-24 16:24:19 +02:00
Ferenc Szili	c6c9c316a7	load_balancer: fix std::out_of_bounds when decommissioning with empty nodes Consider the following: The tablet load balancer is working on: - node1: an empty node (no tablets) with a large disk capacity - node2: an empty node (no tablets) with a lower disk capacity then node1 - node3: is being decommissioned and contains tablet replicas In load_balancer::make_internode_plan() the initial destination node/shard is selected like this: // Pick best target shard. auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)}; load_sketch::get_least_loaded_shard(host_id) calls ensure_node() which adds the host to load_sketch's internal hash maps in case the node was not yet seen by load_sketch. Let's assume dst is a shard on node1. Later in load_balancer::make_internode_plan() we will call pick_candidate() to try to find a better destination node than the initial one: // May choose a different source shard than src.shard or different destination host/shard than dst. auto candidate = co_await pick_candidate(nodes, src_node_info, target_info, src, dst, nodes_by_load_dst, drain_skipped); auto source_tablets = candidate.tablets; src = candidate.src; dst = candidate.dst; If pick_candidate() selects some other empty destination (due to larger capacity: node1) node, and that node has not yet been seen by load_sketch (because it was empty), a subsequent call to load_sketch::pick() will search for the node using std::unordered_map::at(), and because the node is not found it will throw a std::out_of_bounds() exception crashing the load balancer. This problem is fixed by changing load_sketch::populate() to initialize its internal maps with all the nodes which populate()'s arguments filter for. Fixes: #26203 Closes scylladb/scylladb#26207	2025-09-24 15:27:19 +02:00
Pavel Emelyanov	b85673e9b0	test,lib: Add range_to_endpoint_map() method to rest client Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-24 15:44:57 +03:00
Pavel Emelyanov	5746e61a60	api: Indentation fix after previous patches Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-24 15:44:50 +03:00
Pavel Emelyanov	1649808429	storage_service: Get tablet tokens if e.r.m. is per-table Getting all token metadata tokens is not correct, the resulting map will be over-populated. Compare this with effective_ownership() method -- it also gets different tokens for vnodes and tablets cases. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-24 15:42:59 +03:00
Pavel Emelyanov	bac9f200b3	storage_service,api: Get e.r.m. inside get_range_to_address_map() Now it's the caller (API handler) that gets e.r.m. from keyspace or table, and this patch moves this selection into the callee. This is preparational change. Next patch will need to pass optional table_id to get_range_to_address_map(), and to make this table_id presense consistent with whether e.r.m. is per table or not, it's simpler to move e.r.m. evaluation into the latter method as well. (indentation in API handler is deliberately left broken) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-24 15:40:14 +03:00
Pavel Emelyanov	0c258187d9	storage_service: Calculate tokens on stack And std::move() it into the callee. No functional changes here, it's here to reduce churn in the next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-24 15:39:23 +03:00
Avi Kivity	fb5664a1d5	interval: split interval_bound implementation for const references In `f3dccc2215` ("interval: change start()/end() not to return references to data members"), we introduced interval_bound_const_ref as a lightweight alternative to interval_bound that does not carry a T. This was needed because interval no longer contains interval_bound:s. This interval_bound_const_ref was just an interval_bound<const T&>, and converting constructors and operators were added to move between the interval_bound<T> and interval_bound<const T&>. However, these happen to be illegal in C++ and just happened to work in clang 20. Clang 21 tightened its checks and these are now flagged. The problem is that when instantiating interval_bound<const T&> the converting constructor looks like a copy constructor; and while it's behind a constraint (that evaluates to false) the rules don't care about that. Fix this by having a separate interval_bound_const_ref template. The new template is slightly better as it allows assignment (since the payload is a pointer rather than a reference). Not that it's really needed. The C++ rule was reported [1] as too restrictive, but there is no resolution yet. [1] https://cplusplus.github.io/CWG/issues/2837.html Closes scylladb/scylladb#26081	2025-09-24 13:57:21 +02:00
Alex Dathskovsky	5e89a78c8f	raft: refactor can_vote logic and type This PR refactors the can_vote function in the Raft algorithms for improved clarity and maintainability by providing safer strong boolean types to the raft algorithm. Fixes: #21937 Backport: No backport required Closes scylladb/scylladb#25787	2025-09-24 13:55:05 +02:00
Nikos Dragazis	a7e46974d4	db/config: Prepare compression_parameters for config system SSTable compression is currently configurable only per table, via the `compression` property in CREATE/ALTER TABLE statements. This is represented internally via the `compression_parameters` class. We plan to offer the same options via the configuration as well, to make the default compression method for user tables configurable. This patch prepares the ground by making the `compression_parameters` usable as a `config_file::named_value`, namely: * Define an extraction operator (required by `boost::program_options` for parsing the options from command line). * Define a formatter (required by `named_value::operator()`). * Define a template specialization for `config_type_for` (required by `named_value` constructor). * Define a yaml converter (required for parsing the options from scylla.yaml). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-24 14:51:39 +03:00
Nikos Dragazis	ea41f652c4	compressor: Validate presence of sstable_compression in parameters SSTable compression parameters should always define an algorithm via the `sstable_compression` sub-option. Add a check in the constructor to ensure this is always provided (unless no options are given, which is interpreted as "no compression"). This change has no user-visible effect, since the same check is already performed at a higher-level, while validating the CQL properties of CREATE TABLE and ALTER TABLE statements (see `cf_prop_defs::validate()`). However, it will become useful in later patches, when compression config options will be introduced. Although now redundant, keep the sanity check in `cf_prop_defs::validate()` to maintain consistency of error messages with Cassandra. Note also that Cassandra uses 'class' instead of 'sstable_compression' since version 3.11.10, but Scylla still doesn't support this, see: https://github.com/scylladb/scylladb/issues/4200 Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-24 14:50:04 +03:00
Botond Dénes	761a32927e	Merge 'scrub: Handle malformed_sstable_exception in scrub skip mode' from Taras Veretilnyk This PR improves the handling of malformed SSTables during scrub and adds tests to validate the updated behavior. When scrub is used, there is an increased chance of encountering malformed SSTables. These should not be retried as in regular compaction. Instead, they must be handled according to the selected scrub mode: in skip mode, in case of malformed_sstable_exception, invalid data or whole SSTable should be removed, in abort and segregate modes, the scrub process should abort. Previously, all modes treated malformed_sstable_exception the same way, causing scrub to abort even when skip mode was selected. This PR updates the scrub logic to properly handle malformed SSTable exceptions based on the selected mode. Unit tests are added to verify the intended behavior. Fixes scylladb/scylladb#19059 Backport is not required, it is an improvement Closes scylladb/scylladb#25828 * github.com:scylladb/scylladb: sstable_compaction_test: add scrub tests for malformed SSTables scrub: skip sstable on malformed sstable exception in skip mode	2025-09-24 14:28:43 +03:00
Avi Kivity	b67928a457	build: suppress linker pgo hash mismatch warnings Since we generage pgo profiles once a fortnight (and not every build), pgo hash mismatches are expected as the code built diverges from the code measured. The warnings about hash mismatches don't provide any value (they cannot be acted upon) and are only distracting. A possible downside is that we'll miss the pgo training process failing (it is visible in the warnings list getting longer and longer), but that's not a proper indication. Suppress them with the appropriate linker switch. Ref #26010. Closes scylladb/scylladb#26162	2025-09-24 13:07:07 +02:00
Ernest Zaslavsky	5ba5aec1f8	treewide: Move mutation related files to a `mutation` directory As requested in #22104, moved the files and fixed other includes and build system. Moved files: - combine.hh - collection_mutation.hh - collection_mutation.cc - converting_mutation_partition_applier.hh - converting_mutation_partition_applier.cc - counters.hh - counters.cc - timestamp.hh Fixes: #22104 This is a cleanup, no need to backport Closes scylladb/scylladb#25085	2025-09-24 13:23:38 +03:00
Artsiom Mishuta	1493fd1521	test.py: adjust framework verbosity default settings Currently, test.py always runs in verbose mode in non-TTY due to the custom reporter TabularConsoleOutput (custom-implemented test reporter in test.py) limitations, because in non-verbose mode, it "Use ANSI Escape sequences for manipulating console," which is not possible in non-TTY. So, it also affects the new pure pytest runner (now only boosts tests run under it), but it should not, since pytest standard output does not depend on TTY. This commit changes the logic to increase only TabularConsoleOutput verbosity for CI(non-TTY) run instead of the whole framework. Closes scylladb/scylladb#26202	2025-09-24 12:05:14 +02:00
Avi Kivity	0cfb78ed38	build: drop -fvisibility=hidden compiler flag The -fvisibility=hidden flag removes a shared library's symbols from the dynamic symbol table, reducing the shared object size and the dynamic linking time. In return, the user promises not to rely on the uniqueness of objects declared with exactly the same name in the shared libary and the main executable. However, we violate this assumption. In Seatar's noncopyable_function, we compare _vtable to _s_empty_vtable (in operator bool()). The full name of this _s_empty_vtable is seastar::noncopyable_function<Signature>::_s_empty_vtable. Since it can be instantiated in both the main executable and in libseastar.so, the comparison can fail even though we're comparing what is, in C++ view, the address of a unique object to itself. To solve the problem, we can either: - reimplement noncopyable_function::operator bool in a way that does not depent on the object name (for exaple, 'return _vtable->empty;') where _empty is a boolean initialized to true for the empty vtable), and be careful not to repeat the mistake elsewhere - drop -fvisibility=hidden Here, we choose the second option. The benefit of -fvisibility=hidden is important, but we only use shared libraries in debug mode, and the time spent chasing these apparent one-definition-rule violations is more important than some milliseconds during launch time and a few megabytes in libseastar.so. We do trade it in for -fvisibility-inlines-hidden. This is similar to -fvisibility=hidden, but applies only to addresses of inline functions. Since comparing functions by address is very rare, and inline functions are very common, it seems like a reasonable tradeoff to make. Fixes #25286. Closes scylladb/scylladb#26154	2025-09-24 12:32:35 +03:00
Botond Dénes	1ac7b4c35e	treewide: move away from accessing httpd::request::query_parameters Acecssing this member directly is deprecated, migrate code to use {get,set}_query_param() and friends instead. Fixes: https://github.com/scylladb/scylladb/issues/26023	2025-09-24 11:52:15 +03:00
Botond Dénes	5891aeb1fb	test/pylib/s3_server_mock.py: better handle empty query params Instead of re-inventing empty param handling, use the built-in keep_blank_values=True param of the urllib.parse.parse_qs(). Handles correctly the case where the `=` is also present but no value follows, this is the sytnax used by the new query_params in seastar::http::request. Also add an exception to build_POST_response(). Better than a cryptic message about encode() not callable on NoneType.	2025-09-24 11:52:15 +03:00
Karol Nowacki	706eeee1bd	vector_store_client: Rename HTTP_REQUEST_RETRIES to ANN_RETRIES Rename `HTTP_REQUEST_RETRIES` to `ANN_RETRIES` in `vector_store_client`, as it now applies to all vector store nodes, not just HTTP requests. Also, remove an unused test setter function.	2025-09-24 10:51:43 +02:00
Karol Nowacki	8411a03f22	vector_store_client: Format with clang-format	2025-09-24 10:41:37 +02:00
Karol Nowacki	57d1b601a8	vector_store_client: Add support for multiple IPs in DNS responses The DNS resolution logic now processes all IP addresses returned in a DNS response, not just the primary one. The client will iterate through the list of resolved IPs, attempting to query the next one if a request fails. This improves high availability by allowing the client to query other available nodes if one is down.	2025-09-24 10:41:37 +02:00
Karol Nowacki	cc616252a4	vector_store_client_test: Extract `make_vs_server` helper function The `make_vs_server` function is refactored into a standalone helper to allow its reuse in upcoming test cases.	2025-09-24 10:41:37 +02:00
Karol Nowacki	6da598fa4a	vector_store_client_test: Ensure cleanup on exception Move the mock/test server shutdown into a `finally()` block to guarantee cleanup even if the test case throws an exception.	2025-09-24 10:41:37 +02:00
Karol Nowacki	381586f1b8	vector_store_client_test: Fix unreliable unavailable port tests The `generate_unavailable_localhost_port` function is not robust because it can suffer from a race condition. It finds an available port but does not keep it occupied, meaning another process could bind to it before the test can use it. The `unavailable_server` helper is a more robust solution. It creates a server that listens on a port for its entire lifetime and immediately closes any incoming connections. This guarantees the port remains unavailable, making the test more reliable.	2025-09-24 10:23:24 +02:00
Piotr Dulikowski	bfb8e807be	Merge 'streaming/stream_blob: generate view updates from staging sstables' from Michał Jadwiszczak After https://github.com/scylladb/scylladb/pull/22034, staging status of sstables streamed via file streaming was ignored and view updates were never generated. This patch fixes it and now staging sstables are registered to `view_building_worker`. Then, the worker create view building tasks for those sstables, so the view building coordinator can schedule them once the tablet migration is finished. Fixes https://github.com/scylladb/scylla-enterprise/issues/4572 This fix affects only views on tablets, which are still experimental, so no backport needed. Closes scylladb/scylladb#25776 * github.com:scylladb/scylladb: test/test_view_building_coordinator: add reproducer for file streaming streaming/stream_blob: register staging sstables to process them	2025-09-24 09:15:33 +02:00
Lakshmi Narayanan Sreethar	f2308a2ce5	compaction/twcs: use on_internal_error for invalid timestamp resolution When the `time_window_compaction_strategy::to_timestamp_type` encounters an invalid timestamp resolution it throws an `std::runtime_error`. This patch replaces it with `on_internal_error()` and also logs the invalid timestamp resolution for easier debugging. Refs #25913 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26138	2025-09-24 09:14:16 +03:00
Wojciech Mitros	eb92f50413	hinted_handoff: drain hints after the target node stops owning tokens When a node is being replaced, it enters a "left" state while still owning tokens. Before this patch, this is also the time when we start draining hints targeted to this node, so the hints may get sent before the token ownership gets migrated to another replica, and these hints may get lost. In this patch we postpone the hint draining for the "left" nodes to the time when we know that the target nodes no longer hold ownership of any tokens - so they're no longer referenced in topology. I'm calling such nodes "released". Before this change, when we were starting draining hints, we knew the IP addresses of the target nodes. We lose this information after entering "left" stage, so when draining hints after a node is "released", we can't drain the hints targeted to a specific IP instead of host_id. We may have hints targeted to IPs if the migration rom IP-based to host_ID-based hints didn't happen yet. The migration happens when enabling a cluster feature since 2024.2.0, so such hints can only exist if we perform a direct upgrade from a version before 2024.2.0 to a version that has this change (2025.4.0+). To avoid losing hints completely when such an upgrade is combined with a node removal/replacement, we still drain hints when the node enters a "left" state and the migration of hints to host_id wasn't performed yet. For these drains, the problematic scenario can't occur because it only affects tablets, and when upgrading from a version before 2024.2.0, no tablets can exist yet. If we perform such a drain, we no longer need to drain hints when entering the "released" state, so we only drain when entering that state if the migration was already completed. With this setup, we'll always drain hints at least once when a node is leaving. However, if the migration to host_ids finishes between entering the "left" state and the "released" state, we'll attempt to drain the hints twice. This shouldn't be problem though because each `drain_for()` is performed with the `_drain_lock` and after a `hint_endpoint_manger` is drained, it's removed, so we won't try to drain it twice. This patch also includes a test for verifying that hints are properly replayed after a node replace. Fixes https://github.com/scylladb/scylladb/issues/24980 Closes scylladb/scylladb#24981	2025-09-24 07:11:59 +02:00
Michał Chojnowski	b76716c8aa	tools/schema_loader: disable tablet-related restrictions in the placeholder keyspace Passing `0` as the `initial_tablets` argument causes `schema_loaders`'s placeholder keyspace to be a tablet keyspace. This causes `scylla sstable` to reject some table schemas which are legitimate in this context. For example, `scylla sstable` refuses to work with sstables which contains `counter` columns, because tablets don't support counters. This is undesirable. Let's make `schema_loader`'s keyspace a non-tablet keyspace. Closes scylladb/scylladb#26192	2025-09-24 06:55:28 +03:00
Botond Dénes	7c6fb131f3	Merge 'compaction: ensure that all compaction executors are stopped' from Aleksandra Martyniuk Currently, while stopping the compaction_manager, we stop task_manager compaction module and concurrently run compaction_manager::really_do_stop. really_do_stop stops and waits for all task_executors that are kept in compaction_manager::_tasks, but nothing ensures that no more tasks will be added there. Due to leftover tasks, we trigger on_fatal_internal_error. Modify the order of compaction_manager::stop. After the change, we stop compaction tasks in the following order: - abort module abort source; - close module gate in the background; - stop_ongoing_compactions (kept in compaction_manager::_tasks); - wait until module gate is closed. Check module abort source before creating compaction executor and adding it to _tasks. Thanks to the above, we can be sure that: - after module::stop there will be no tasks in _tasks; - compaction_manager::stop aborts all tasks; we don't wait for any whole compaction to finish. Fixes: https://github.com/scylladb/scylladb/issues/25806. Fixes shutdown bug; Needs backports to all version Closes scylladb/scylladb#25885 * github.com:scylladb/scylladb: compaction: move _tasks check compaction: stop compaction module in really_do_stop	2025-09-24 06:49:52 +03:00
Lakshmi Narayanan Sreethar	82c95699ea	types/comparable_bytes: add compatability test data for DateTpe Byte comparable encoding for DateType was introduced in `bf90018b8e`. This PR updates the compatibility test data to include the type in the test coverage. Refs #19407 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26208	2025-09-24 06:42:24 +03:00
Aleksandra Martyniuk	48bbe09c8b	test: fix test_two_tablets_concurrent_repair_and_migration_repair_writer_level test_two_tablets_concurrent_repair_and_migration_repair_writer_level waits for the first node that logs info about repair_writer using asyncio.wait. The done group is never awaited, so we never learn about the error. The test itself is incorrect and the log about repair_writer is never printed. We never learn about that and tests finishes successfully after 10 minutes timeout. Fix the test: - disable hinted handoff; - repair tablets of the whole table: - new table is added so that concurrent migration is possible; - use wait_for_first_completed that awaits done group; - do some cleanups. Remove nightly mark. Fixes: #26148. Closes scylladb/scylladb#26209	2025-09-24 06:40:45 +03:00
Avi Kivity	2239474a87	Merge 'tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true' from Tomasz Grabiec Greatly improves performance of plan making, because we don't consider candidates in other racks, most of which will fail to be selected due to replication constraints (no rack overload). Also (but minor) reduces the overhead of candidate evaluation, as we don't have to evaluate rack load. Enabled only for rf_rack_valid_keyspaces because such setups guarantee that we will not need (because we must not) move tablets across racks, and we don't need to execute the general algorithm for the whole DC. Tested with perf-load-balancing, which performs a single scale-out operation on a cluster which initially has 10 nodes 88 shards each, 2 racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new nodes (same shard count). Time to reballance the cluster (plan making only, sum of all iterations, no streaming): Before: 16 min 25 s After: 0 min 25 s Before, plan making cost (single incremental iteration) alternated between fast (0.1 [s]) and slow (14.1 [s]): testlog - Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0) testlog - Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0) The slow run chose min and max nodes in different racks, hence the fast path failed to find any candidates and we switched to exhaustive search of candidates in other nodes. After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls. Fixes #26016 Closes scylladb/scylladb#26017 * github.com:scylladb/scylladb: test: perf: perf-load-balancing: Add parallel-scaleout scenario test: perf: perf-load-balancing: Convert to tool_app_template tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true	2025-09-23 22:45:35 +03:00
Tomasz Grabiec	981592bca5	tablet: scheduler: Do not emit conflicting migrations in the plan Plan-making is invoked independently for different DCs (and in the future, racks) and then plans are merged. It could be that the same tablets are selected for migration in different DCs. Only one migration will prevail and be committed to group0, so it's not a correctness problem. Next cycle will recognize that the tablet is in transition and will not be selected by plan-maker. But it makes plan-making less efficient. It may also surprise consumers of the plan, like we saw in #25912. So we should make plan-maker be aware of already scheduled transitions and not consider those tablets as candidates. Fixes #26038 Closes scylladb/scylladb#26048	2025-09-23 22:40:08 +03:00
Nikos Dragazis	1106157756	compressor: Add missing space in exception message Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-23 21:04:22 +03:00
Łukasz Paszkowski	5089ffe06f	tools: toolchain: add e2fsprogs, fuse3 to the dependencies The packages contain filesystem utilities to create volumes such that sudo/unshare are not required. Closes #26135 [avi: regenerate frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz ] Closes scylladb/scylladb#26165	2025-09-23 18:49:37 +03:00
Botond Dénes	f9172c934a	Merge 'comparable_bytes: handle counter type' from Lakshmi Narayanan Sreethar Byte comparable format is not supported for counter types. This patch adds explicit handling for them for completeness, allowing the default abstract type handler to be removed. Refs #19407 New features. No need to backport. Closes scylladb/scylladb#26206 * github.com:scylladb/scylladb: types/comparable_bytes: remove default abstract type handler types/comparable_bytes: handle counter type	2025-09-23 18:23:49 +03:00
Pavel Emelyanov	f6e8a14fb0	Update seastar submodule Includes fix for scylla-gdb -- the fair_queue is now hierarchy and priority queues are no longer accessible via _handles member. However, the fix is incomplete -- it silently assumes that the hierarchy is flat (and it _is_ flat now, scylla doesn't yet create nested groups) but it should be handled eventually * seastar b6be384e...c8a3515f (8): > Merge 'Nested scheduling groups (IO classes)' from Pavel Emelyanov test: Add test case for wake-from-idle accumulator fixups test: Add fair_queue test for queues activations test: Expand fair queue random run test with groups test: Add test for basic fair-queue nested linkage test: Cleanup scheduling groups after io_queue_test cases code: Update IO priority group shares from supergroup shares change io_queue: Register class groups in fair-queue fair_queue: Add test class friendship fair_queue: Move nr_classes into group_data fair_queue: Fix indentation after previous patch fair_queue: Implement hierarchical queue activation (wakeup) fair_queue: Remove now unused push/pop helpers fair_queue: Implement hierarchical priority_entry::pop_front() fair_queue: Implement hierarchical priority_entry::top() fair_queue: Link priority_entries into tree fair_queue: Add priority_class_group_data::reserve() fair_queue: Move more bits onto priority_entry fair_queue: Move shares on priority_entry fair_queue: Move last_accumulated on priority_class_group_data fair_queue: Introduce priority_class_group_data fair_queue: Inherit priority_class_data from priority_entry fair_queue: Rename priority_class_ptr ioqueue: Opencode get_class_info() helper ioq: Move fair_queue class registration down > tls: Rework session termination > http/request: get_query_param(): add default_value argument > http/request: add has_query_param() > sharded: Deprecate distributed alias > io_tester: Allow configuring sloppy_size_hint for files > file: Remove duplicating static_assert-ions Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26143	2025-09-23 18:20:45 +03:00
Michał Jadwiszczak	73ce19939e	test/test_view_building_coordinator: add reproducer for file streaming The test reproduces issue scylladb/scylla-enterprise#4572. Before the fix, file-streaming didn't respect staging status of a sstable and view updates weren't generated, leading to base-view inconsistency. The test creates the inconsistency in the view, triggers file-streaming of staging sstables and verifies that the updates are generated.	2025-09-23 15:34:42 +02:00
Michał Jadwiszczak	0f3827d509	streaming/stream_blob: register staging sstables to process them After scylladb/scylladb#22034, staging status of sstables streamed via file streaming was ignored and view updates were never generated. This patch fixes it and now staging sstables are registered to `view_building_worker`. Then, the worker create view building tasks for those sstables, so the view building coordinator can schedule them once the tablet migration is finished. Fixes scylladb/scylla-enterprise#4572	2025-09-23 15:34:42 +02:00
Patryk Jędrzejczak	da44d6af09	Merge 'Move some compaction manager API handlers from storage_service.cc to tasks.cc' from Pavel Emelyanov There's a bunch of /storage_service/... endpoints that start compaction manager tasks and wait for it. Most of them have async peer in /tasks/... that start the very same task, but return to the caller with the task ID. This patch moves those handlers' code from storage_service.cc to tasks.cc, next to the corresponding async peers, to keep handlers that need compaction_manager in one place. That's preparation for more future changes. Later all those endpoints will stop using database from http_context and will capture the compaction_manager they need from main, like it was done in #20962 for /compaction_manager/... endpoints. Even "more later", the former and the latter blocks of endpoints will be registered and unregistered together, e.g. like database endpoints were collected in one reg/unreg sequence by #25674. Part of http_context dependencies cleanup effort, no need to backport. Closes scylladb/scylladb#26140 * https://github.com/scylladb/scylladb: api: Move /storage_service/compact to tasks.cc api: Move /storage_service/keyspace_upgrade_sstables to tasks.cc api: Move /storage_service/keyspace_offstrategy_compaction to tasks.cc api: Move /storage_service/keyspace_cleanup to tasks.cc api: Move /storage_service/keyspace_compaction to tasks.cc	2025-09-23 15:08:48 +02:00
Taras Veretilnyk	b81caf3f54	sstable_compaction_test: add scrub tests for malformed SSTables Add unit tests for scrub behavior with malformed SSTables: - sstable_scrub_abort_mode_malformed_sstable_test(verifies scrub aborts on malformed SSTables) - sstable_scrub_skip_mode_malformed_sstable_test(verifies scrub skips malformed SSTables without aborting)	2025-09-23 14:34:09 +02:00
Taras Veretilnyk	796bdfc9f7	scrub: skip sstable on malformed sstable exception in skip mode Previously, malformed_sstable_exception during scrub was handled the same way in abort, skip and segragate mode, causing the scrub process to abort even when skip was specified. This commit updates the behavior to correctly handle malformed_sstable_exception in skip mode by removing invalid data or whole malformed SSTable instead of aborting the entire scrub.	2025-09-23 14:34:09 +02:00
Aleksandra Martyniuk	97c77d7cd5	compaction: move _tasks check In compaction_manager::really_do_stop we check whether _tasks list is empty after the compactions are stopped. However, a new task may still sneak in, causing the assertion failure. Such a task won't be there for long - module::make_task will fail as the module is already stopped. Move the assertion, that checks if _tasks is empty, after the compaction_states' gates are closed. Fixes: #25806.	2025-09-23 14:22:19 +02:00
Aleksandra Martyniuk	17707d0e6b	compaction: stop compaction module in really_do_stop Currently, compaction::task_manager_module is stopped in compaction_manager::stop, concurrently to really_do_stop. We can't predict the order of the two. Do not set _task_manager_module to nullptr at stop, because compaction_manager::really_do_stop() may be called before the actual shutdown, while other components still try to use it. compaction::task_manager_module does not keep a pointer to compaction_manager, so we won't end up with memory leak. Stop compaction module in really_do_stop, after ongoing compactions are stopped. It's a preparation for further patches.	2025-09-23 14:21:15 +02:00
Lakshmi Narayanan Sreethar	0914978605	types/comparable_bytes: remove default abstract type handler Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-23 13:44:59 +05:30
Lakshmi Narayanan Sreethar	ee1e648a7f	types/comparable_bytes: handle counter type Byte comparable format is not supported for counter types. This patch adds explicit handling for them for completeness, allowing the default abstract type handler to be removed in the next patch. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-23 13:44:59 +05:30
Andrzej Jackowski	c8f45dbbb2	test: speed up test_long_query_timeout_erm `test_long_query_timeout_erm` is slow because it has many parameterized variants, and it verifies timeout behavior during ERM operations, which are slow by nature. This change speeds the test up by roughly 3× (319s -> 114s) by: - Removing two of the five scenarios that were near duplicates. - Shortening timeout values to reduce waiting time. - Parallelizing waiting on server_log with asyncio.TaskGroup(). The two removed scenarios (`("SELECT", True, False)`, `("SELECT_WHERE", True, False)`) were near duplicates to `("SELECT_COUNT_WHERE", True, False)` scenario, because all three scenarios use non-mapreduce query and triggers basically the same system behavior. It is sufficient to keep only one of them, so the test verifies three cases: - One with nodes shutdown - One with mapreduce query - One with non-mapreduce query Fixes: scylladb/scylla#24127 Closes scylladb/scylladb#25987	2025-09-23 10:28:07 +03:00
Piotr Dulikowski	482ddfb3b4	Merge 'mv: handle mismatched base/view replica count caused by RF change' from Wojciech Mitros During an ALTER KEYSPACE statement execution where a table with a view is present, we need to perform tablet migrations for both tables. These migrations are not synchronized, so at some point the base may have a different number of non-pending replicas than the view. Because of that, we can't pair them correctly. If there is more non-pending base replicas than view replicas, we don't need to do anything because the view replica that didn't finish migrating is a pending replica and will get view updates from all base replicas. But if there is more non-pending view replicas than base replicas, we may currently lose view updates to the new view replica. This patch adds a workaround for this scenario. If after one migration we have too more non-pending view replicas than base replicas, we add it to the pending replica list so that it gets an update anyway. This patch will also take effect if the base and view replica counts differ due to some other bug. To track that, a new metric is added to count such occurrences. This patch also includes a test for this exact scenario, which is enforced by an injection. Fixes https://github.com/scylladb/scylladb/issues/21492 Closes scylladb/scylladb#24396 * github.com:scylladb/scylladb: mv: handle mismatched base/view replica count caused by RF change mv: save the nodes used for pairing calculations for later reuse mv: move the decision about simple rack-aware pairing later	2025-09-23 08:10:08 +02:00
Dawid Mędrek	35f7d2aec6	db/batchlog: Drop batch if table has been dropped If there are pending mutations in the batchlog for a table that has been dropped, we'll keep attempting to replay them but with no success -- `db::no_such_column_family` exceptions will be thrown, and we'll keep trying again and again. To prevent that, we drop the batch in that case just like we do in the case of a non-existing keyspace. A reproducer test has been included in the commit. It fails without the changes in `db/batchlog_manager.cc`, and it succeeds with them. Fixes scylladb/scylladb#24806 Closes scylladb/scylladb#26057	2025-09-23 07:48:59 +02:00
Tomasz Grabiec	2b03a69065	test: perf: perf-load-balancing: Add parallel-scaleout scenario Simulates reblancing on a single scale-out involving simultaneous addition of multiple nodes per rack. Default parameters create a cluster with 2 racks, 70 tables, 256 tablets/table, 10 nodes, 88 shards/node. Adds 6 nodes in parallel (3 per rack). Current result on my laptop: testlog - Rebalance took 21.874 [s] after 82 iteration(s)	2025-09-23 00:31:31 +02:00
Tomasz Grabiec	0dcaaa061e	test: perf: perf-load-balancing: Convert to tool_app_template To support sub-commands for testing different scenarios. The current scenario is given the name "rolling-add-dec".	2025-09-23 00:30:38 +02:00
Tomasz Grabiec	c9f0a9d0eb	tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true Greatly improves performance of plan making, because we don't consider candidates in other racks, most of which will fail to be selected due to replication constraints (no rack overload). Also (but minor) reduces the overhead of candidate evaluation, as we don't have to evaluate rack load. Enabled only for rf_rack_valid_keyspaces because such setups guarantee that we will not need (because we must not) move tablets across racks, and we don't need to execute the general algorithm for the whole DC. Tested with perf-load-balancing, which performs a single scale-out operation on a cluster which initially has 10 nodes 88 shards each, 2 racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new nodes (same shard count). Time to rebalance the cluster (plan making only, sum of all iterations, no streaming): Before: 16 min 25 s After: 0 min 25 s Before, plan making cost (single incremental iteration) alternated between fast (0.1 [s]) and slow (14.1 [s]): Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0) Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0) The slow run chose min and max nodes in different racks, hence the fast path failed to find any candidates and we switched to exhaustive search of candidates in other nodes. After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls. Fixes #26016	2025-09-23 00:30:37 +02:00
Patryk Jędrzejczak	a56115f77b	test: deflake driver reconnections in the recovery procedure tests All three tests could hit https://github.com/scylladb/python-driver/issues/295. We use the standard workaround for this issue: reconnecting the driver after the rolling restart, and before sending any requests to local tables (that can fail if the driver closes a connection to the node that restarted last). All three tests perform two rolling restarts, but the latter ones already have the workaround. Fixes #26005 Closes scylladb/scylladb#26056	2025-09-22 17:21:06 +02:00
Pavel Emelyanov	bc72a637bd	sstables: Restore indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-22 17:49:11 +03:00
Pavel Emelyanov	d97d595827	sstables: Coroutinize filesystem_storage::check_create_links_replay() Inner lambda only. The outer is a single parallel_for_each, probably not worth it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-22 17:48:42 +03:00
Andrzej Jackowski	15e71ee083	test: audit: stop using datetime.datetime.now() in syslog converter `line_to_row` is a test function that converts `syslog` audit log to the format of `table` audit log so tests can use the same checks for both types of audit. Because `syslog` audit doesn't have `date` information, the field was filled with the current date. This behavior broke the tests running at 23:59:59 because `line_to_row` returned different results on different days. Fixes: scylladb/scylladb#25509 Closes scylladb/scylladb#26101	2025-09-22 15:31:33 +03:00
Pavel Emelyanov	b23aab882a	Merge 'test/alternator: multiple fixes for tests so they would pass on DynamoDB' from Nadav Har'El Issue #26079 noted that multiple Alternator tests are failing when run against DynamoDB. This pull request fixes many of them, in several small patches. In one case we need to avoid a DynamoDB bug that wasn't even the point of the original test (and we create a new test specifically for that DynamoDB bug). Another test exposed a real incompatibility with Alternator (#26103) but didn't need to be exposed in this specific test so again we split the test to one that passes, and another one that xfails on Alternator (not on DynamoDB). A bigger changed had to be done to the tags feature test - since August 2024, the TagResource operation became asynchronous which broke our tests, so we fix this. Each of these changes are described in more detail in the individual patches. Refs #26079. It doesn't fix it completely because there are some tests which remain flaky, and some tests which, surprisingly, pass on us-east-1 but fail on eu-north-1. We'll need to address the rest later. No backports needed, we only run tests against DynamDB from master (when we rarely do...), not on old branches. Closes scylladb/scylladb#26114 * github.com:scylladb/scylladb: test/alternator: fix test_list_tables_paginated on DynamoDB test/alternator: fix tests in test_tag.py on DynamoDB test/alternator: fix test_health_only_works_for_root_path on DynamoDB test/alternator: reproducer tests for faux GSI range key problem test/alternator: fix test "test_17119a" to pass on DynamoDB test/alternator: fix test to pass on DynamoDB	2025-09-22 15:30:40 +03:00
Pavel Emelyanov	f6860d1de0	Merge 'mv: run view building worker fibers in streaming group' from Piotr Dulikowski The background fibers of the view building worker are indirectly spawned by the main function, thus the fibers inherit the "main" scheduling group. The main scheduling group is not supposed to be used for regular work, only for initialization and deinitialization, so this is wrong. Wrap the call to `start_backgroud_fibers()` with `with_scheduling_group` and use the streaming scheduling group. The view building worker already handles RPCs in the streaming scheduling group (which do most of the work; background fibers only do some maintenance), so this seems like a good fit. No need to backport, view build coordinator is not a part of any release yet. Closes scylladb/scylladb#26122 * github.com:scylladb/scylladb: mv: fix typo in start_backgroud_fibers mv: run view building worker fibers in streaming group	2025-09-22 15:28:38 +03:00
Pavel Emelyanov	ce8dd798a2	Merge 'tools/scylla-sstable-scripts: introduce purgeable.lua and writetime-histogram.lua' from Botond Dénes `purgeable.lua` was written for a specific investigation a few years ago. `writetime-histogram.lua` is an sstable script transcription of the former scylla-sstable writetime-histogram command. This was also written for an investigation (before script command existed) and is too specific to be a native command, so was removed by `edaf67edcb`. Add both scripts to the sample script library, they can be useful, either for a future investigation, or as samples to copy+edit to write new scripts (and train AI). New sstable scripts, no backport Closes scylladb/scylladb#26137 * github.com:scylladb/scylladb: tools/scylla-sstable-scripts: introduce writetime-histogram.lua tools/scylla-sstable-scripts: introduce purgable.lua	2025-09-22 15:27:49 +03:00
Avi Kivity	29032213c8	test: avoid #include <boost/test/included/...> The boost/test/included/... directory is apparently internal and not intended for user consumption. Including it caused a One-Definition-Rule violation, due to boost/test/impl/unit_test_parameters.ipp containing code like this: ```c++ namespace runtime_config { // UTF parameters std::string btrt_auto_start_dbg = "auto_start_dbg"; std::string btrt_break_exec_path = "break_exec_path"; std::string btrt_build_info = "build_info"; std::string btrt_catch_sys_errors = "catch_system_errors"; std::string btrt_color_output = "color_output"; std::string btrt_detect_fp_except = "detect_fp_exceptions"; std::string btrt_detect_mem_leaks = "detect_memory_leaks"; std::string btrt_list_content = "list_content"; ``` This is defining variables in a header, and so can (and in fact does) create duplicate variable definitions, which later cause trouble. So far, we were protected from this trouble by -fvisibility=hidden, which caused the duplicate definitions to be in fact not duplicate. Fix this by correcting the include path away from <boost/test/included/>. Closes scylladb/scylladb#26161	2025-09-22 15:26:06 +03:00
Wojciech Mitros	d9b8278178	mv: handle mismatched base/view replica count caused by RF change During an ALTER KEYSPACE statement execution where a table with a view is present, we need to perform tablet migrations for both tables. These migrations are not synchronized, so at some point the base may have a different number of non-pending replicas than the view. Because of that, we can't pair them correctly. If there is more non-pending base replicas than view replicas, we don't need to do anything because the view replica that didn't finish migrating is a pending replica and will get view updates from all base replicas. But if there is more non-pending view replicas than base replicas, we may currently lose view updates to the new view replica. This patch adds a workaround for this scenario. If after one migration we have too more non-pending view replicas than base replicas, we add it to the pending replica list so that it gets an update anyway. This patch will also take effect if the base and view replica counts differ due to some other bug. To track that, a new metric is added to count such occurrences. This patch also includes a test for this exact scenario, which is enforced by an injection. Fixes https://github.com/scylladb/scylladb/issues/21492	2025-09-22 12:50:16 +02:00
Wojciech Mitros	59c40a2edd	mv: save the nodes used for pairing calculations for later reuse In get_view_natural_endpoint() we start with the list if host_ids from the effective replication maps, which we later translate to locator::node to get the information about racks and datacenters. We check all replicas, but we only store the ones relevant for pairing, so for tablets, the ones in the same DC as the replica sending the update. In the next patch, we'll occasionally need to send cross-dc view updates, so to avoid computing the nodes again, in this patch we adjust the logic to prepare them in advance and save them so that they can be later reused.	2025-09-22 12:45:24 +02:00
Wojciech Mitros	9d4449a492	mv: move the decision about simple rack-aware pairing later We'll need to get the lists for the whole dc when fixing replica count mismatches caused by RF changes, so let's first get these lists, and only filter them later if we decide to use simple rack-aware pairing.	2025-09-22 12:45:24 +02:00
Anna Stuchlik	b18b052d26	doc: remove n1-highmem instances from Recommended Instances	2025-09-22 12:40:36 +02:00
Nadav Har'El	b205e1a3da	Merge 'vector_store_client: Extract DNS logic into a dedicated class' from Karol Nowacki Vector search related implementation moved to a new module vector_search. As the vector search functionality is going to be extended, it is better to keep it in a separate module. The DNS resolution logic and its background task are moved out of the `vector_store_client` and into a new, dedicated class `vector_search::dns`. This refactoring is the first step towards supporting DNS hostnames that resolve to multiple IP addresses. References: VECTOR-187 No backport needed as this is refactoring. Closes scylladb/scylladb#26052 * github.com:scylladb/scylladb: vector_store_client_test: Verify DNS is not refreshed when disabled vector_store_client: Extract DNS logic into a dedicated class vector_search: Apply clang-format vector_store_client: Move to vector_search module	2025-09-22 13:24:34 +03:00
Michael Litvak	beb11760e0	test/cqlpy/test_permissions: unskip test for tablets the test was skipped for tablets because CDC wasn't supported with tablets, but now it is supported and the issue is closed, so the test should be unskipped.	2025-09-22 10:03:32 +02:00
Michael Litvak	65351fda29	alternator: update references to alternator streams issue update all the references about the issue of tablets support for alternator streams to issue #23838 instead of #16317. The issue #16317 is about support of CDC with tablets, but it is now closed and it didn't address alternator streams. the remaining issues about alternator streams should be addressed as part of #23838, so fix the references in order for them not to be missed.	2025-09-22 09:56:23 +02:00
Avi Kivity	1258e7c165	Revert "Merge 'transport: service_level_controller: create and use `driver` service level' from Andrzej Jackowski" This reverts commit `fe7e63f109`, reversing changes made to `b5f3f2f4c5`. It is causing test.py failures around cqlpy. Fixes #26163 Closes scylladb/scylladb#26174	2025-09-22 09:32:46 +03:00
Piotr Dulikowski	b382531d99	Merge 'cdc: fix create table with cdc if not exists' from Michael Litvak Fix an issue where executing a CREATE TABLE IF NOT EXISTS statement with CDC enabled fails with an error if the table already exists. Instead, the query should succeed and be a no-op. This regression was introduced by commit `fed1048059`. Previously, when executing the query, we would first check if the table exists in do_prepare_new_column_families_announcement. If it did, we would throw an already_exists_exception, which was handled correctly; otherwise, we would continue and create the CDC table in the before_create_column_families notification. The order of operations was changed in `fed1048059`, causing the regression. Now, we first create the CDC schema and add it to the schema list for creation, and then check for each of them if they already exist. The problem is that when we create the CDC schema in on_pre_create_column_families, it also checks if the CDC table already exists. If it does, it throws an invalid_request_exception, which is not caught and handled as expected. This patch restores the previous order of operations: we first check if the tables exist, and only then add the CDC schema in pre_create. Fixes https://github.com/scylladb/scylladb/issues/26142 no backport - recent regression, not released yet Closes scylladb/scylladb#26155 * github.com:scylladb/scylladb: test: add test for creating table with CDC enabled if not exists cdc: fix create table with cdc if not exists	2025-09-22 08:18:26 +02:00
Piotr Dulikowski	591a67c7e7	Merge 'view_builder: register view on all shards atomically' from Michael Litvak When the view builder starts to build a new view, each shard registers itself by writing the shard id and current token to the scylla_views_builds_in_progress table. Previously, this happened independently by each shard. We change it now to register all shards "atomically" - when a shard registers itself, it also registers all other shards with an empty status, if they aren't registered yet. This ensures that we don't have a partial state in the table where only some of the shards are registered, but we always have a status for all shards. The reason we want to register all shards atomically is that if it happens that only some of the shards were registered, then we restart and load the status from table, this doesn't work well for multiple reasons. One example is that to know how many shards we had previously, we take the maximum shard id we see in the table. If it's different than the current shard count, we will execute the reshard code. But of course, if the last shard is missing from the table because it didn't register itself, this calculation will be wrong, and we can't know the previous number of shards. This is a problem because suppose we have two shards, and shard 0 finished building the view but shard 1 didn't start. When we come up, we will think that previously we had only a single shard and it completed building everything, when in fact we built only half the view approximately. The problem is that we don't have enough information in the tables to know that. There are additional problems related to reshard. In the reshard function, whether it is executed because we actually do node reshard or because we calculated the wrong number of previous shards, if the status of some shard is missing then the calculation of new ranges will be wrong. When some shard didn't make progress we should start building the view from scratch. However, this doesn't happen if we don't have a status for the shard, because the code looks only for shards that have a status. In effect, this shard is considered complete even though it didn't start. This could cause the view building to get stuck or complete without building all tokens ranges. By registering all shards atomically, this should solve the above problems because we will always have statuses for all shards. Fixes https://github.com/scylladb/scylladb/issues/22989 backport not needed - the issue is probably not common and there's a workaround Closes scylladb/scylladb#25790 * github.com:scylladb/scylladb: test: mv: add a test for view build interrupt during registration view_builder: register view on all shards atomically	2025-09-22 08:03:44 +02:00
Karol Nowacki	6bd1d7db49	vector_store_client_test: Verify DNS is not refreshed when disabled Extend the `vector_store_client_uri_update_to_empty` test case to verify that the DNS resolver stops refreshing when the vector store is disabled.	2025-09-22 08:02:59 +02:00
Karol Nowacki	27219b8b7c	vector_store_client: Extract DNS logic into a dedicated class The DNS resolution logic and its background task are moved out of the `vector_store_client` and into a new, dedicated class `vector_search::dns`. This refactoring is the first step towards supporting DNS hostnames that resolve to multiple IP addresses. Signed-off-by: Karol Nowacki <karol.nowacki@scylladb.com>	2025-09-22 08:01:53 +02:00
Karol Nowacki	7cc7b95681	vector_search: Apply clang-format Run clang-format on the vector_search module to fix minor formatting inconsistencies.	2025-09-22 08:01:50 +02:00
Karol Nowacki	eae71d3e91	vector_store_client: Move to vector_search module Vector search related implementation moved to a new module vector_search. As the vector search functionality is going to be extended, it is better to keep it in a separate module.	2025-09-22 08:01:47 +02:00
Ferenc Szili	d9f272dbdd	load_balancer: fix badness object creation The load balancer introduced the idea of badness, which is a measure of how a tablet migration effects table balance on the source and destination. This is an abbreviated definition of the badness struct: struct migration_badness { double src_shard_badness = 0; double src_node_badness = 0; double dst_shard_badness = 0; double dst_node_badness = 0; ... double node_badness() const { return std::max(src_node_badness, dst_node_badness); } double shard_badness() const { return std::max(src_shard_badness, dst_shard_badness); } }; A negative value for either of these 4 members signifies a good migration (improves table balance), and a positive signifies a bad migration. In two places in the balancer, badness for source and destination is computed independently in two objects of type migration_badness (src_badness and dst_badness), and later combined into a single object similar to this: return migration_badness{ src_badness.shard_badness(), src_badness.node_badness(), dst_badness.shard_badness(), dst_badness.node_badness() }; This is a problem when, for instance, source shard badness is good (less that 0), shard_badness() will return 0 because of std::max(). This way the actual computed badness is not set in the final object. This can lead to incorrect decisions made later by the balancer, when it searches for the best migration among a set of candidates. Closes scylladb/scylladb#26091	2025-09-21 21:37:23 +02:00
Dawid Mędrek	0d2560c07f	test/perf/tablet_load_balancing.cc: Create nodes within one DC In `789a4a1ce7`, we adjusted the test file to work with the configuration option `rf_rack_valid_keyspaces`. Part of the commit was making the two tables used in the test replicate in separate data centers. Unfortunately, that destroyed the point of the test because the tables no longer competed for resources. We fix that by enforcing the same replication factor for both tables. We still accept different values of replication factor when provided manually by the user (by `--rf1` and `--rf2` commandline options). Scylla won't allow for creating RF-rack-invalid keyspaces, but there's no reason to take away the flexibility the user of the test already has. Fixes scylladb/scylladb#26026 Closes scylladb/scylladb#26115	2025-09-21 21:36:43 +02:00
Tomasz Grabiec	ddbcea3e2a	tablets: scheduler: Run plan-maker in maintenance scheduling group Currently, it runs in the gossiper scheduling group, because it's invoked by the topology coordinator. That scheduling group has the same amount of shares as user workload. Plan-making can take significant amount of time during rebalancing, and we don't want that to impact user workload which happens to run on the same shard. Reduce impact by running in the maintenance scheduling group. Fixes #26037 Closes scylladb/scylladb#26046	2025-09-21 18:44:57 +03:00
Tomasz Grabiec	4a83b4eef3	Merge 'topology_coordinator: abort view building a bit later in case of tablet migration' from Piotr Dulikowski In multi DC setup, tablet load balancer might generate multiple migrations of the same tablet_id but only one is actually commited to the `system.tablets` table. This PR moved abortion of view building tasks from the same start of the migration (`<no tablet transition> -> allow_write_both_read_old`) to the next step (`allow_write_both_read_old -> write_both_read_old`). This way, we'll abort only tasks for which the tablet migration was actually started. The PR also includes a reproducer test. Fixes scylladb/scylladb#25912 View building coordinator hasn't been released yet, so no backport is needed. Closes scylladb/scylladb#26144 * github.com:scylladb/scylladb: test/test_view_building_coordinator: add reproducer topology_coordinator: abort view building a bit later in case of tablet migration	2025-09-21 15:41:53 +02:00
Karol Nowacki	eedf506be5	vector_store_client: Rename vector_store_uri to vector_store_primary_uri The configuration setting vector_store_uri is renamed to vector_store_primary_uri according to the final design. In the future, the vector_store_secondary_uri setting will be introduced. This setting now also accepts a comma-separated list of URIs to prepare for future support for redundancy and load balancing. Currently, only the first URI in the list is used. This change must be included before the next release. Otherwise, users will be affected by a breaking change. References: VECTOR-187 Closes scylladb/scylladb#26033	2025-09-21 16:33:10 +03:00
Michael Litvak	3dffb8e0dc	test: mv: add a test for view build interrupt during registration Add a new test that reproduces issue #22989. The test starts view building and interrupts it by restarting the node while some shards registered their status and some didn't.	2025-09-21 10:39:30 +02:00
Michael Litvak	6043409c31	view_builder: register view on all shards atomically When the view builder starts to build a new view, each shard registers itself by writing the shard id and current token to the scylla_views_builds_in_progress table. Previously, this happened independently by each shard. We change it now to register all shards "atomically" - when a shard registers itself, it also registers all other shards with an empty status, if they aren't registered yet. This ensures that we don't have a partial state in the table where only some of the shards are registered, but we always have a status for all shards. The reason we want to register all shards atomically is that if it happens that only some of the shards were registered, then we restart and load the status from table, this doesn't work well for multiple reasons. One example is that to know how many shards we had previously, we take the maximum shard id we see in the table. If it's different than the current shard count, we will execute the reshard code. But of course, if the last shard is missing from the table because it didn't register itself, this calculation will be wrong, and we can't know the previous number of shards. This is a problem because suppose we have two shards, and shard 0 finished building the view but shard 1 didn't start. When we come up, we will think that previously we had only a single shard and it completed building everything, when in fact we built only half the view approximately. The problem is that we don't have enough information in the tables to know that. There are additional problems related to reshard. In the reshard function, whether it is executed because we actually do node reshard or because we calculated the wrong number of previous shards, if the status of some shard is missing then the calculation of new ranges will be wrong. When some shard didn't make progress we should start building the view from scratch. However, this doesn't happen if we don't have a status for the shard, because the code looks only for shards that have a status. In effect, this shard is considered complete even though it didn't start. This could cause the view building to get stuck or complete without building all tokens ranges. By registering all shards atomically, this should solve the above problems because we will always have statuses for all shards. Fixes scylladb/scylladb#22989	2025-09-21 10:39:05 +02:00
Evgeniy Naydanov	85cbe7a8d4	test: add test for creating table with CDC enabled if not exists Check if there are no errors on the second attempt of executing "create table if not exists" query if CDC is enabled.	2025-09-21 09:38:36 +02:00
Michael Litvak	5a7e6e53ff	cdc: fix create table with cdc if not exists Fix an issue where executing a CREATE TABLE IF NOT EXISTS statement with CDC enabled fails with an error if the table already exists. Instead, the query should succeed and be a no-op. This regression was introduced by commit `fed1048059`. Previously, when executing the query, we would first check if the table exists in do_prepare_new_column_families_announcement. If it did, we would throw an already_exists_exception, which was handled correctly; otherwise, we would continue and create the CDC table in the before_create_column_families notification. The order of operations was changed in `fed1048059`, causing the regression. Now, we first create the CDC schema and add it to the schema list for creation, and then check for each of them if they already exist. The problem is that when we create the CDC schema in on_pre_create_column_families, it also checks if the CDC table already exists. If it does, it throws an invalid_request_exception, which is not caught and handled as expected. This patch restores the previous order of operations: we first check if the tables exist, and only then add the CDC schema in pre_create. Fixes scylladb/scylladb#26142	2025-09-21 09:38:36 +02:00
Michał Hudobski	1690e5265a	vector search: correct column name formatting This patch corrects the column name formatting whenever an "Undefined column name" exception is thrown. Until now we used the `name()` function which returns a bytes object. This resulted in a message with a garbled ascii bytes column name instead of a proper string. We switch to the `text()` function that returns a sstring instead, making the message readable. Tests are adjusted to confirm this behavior. Fixes: VECTOR-228 Closes scylladb/scylladb#26120	2025-09-20 07:02:53 +02:00
Michał Jadwiszczak	2aabf8ee3f	test/test_view_building_coordinator: add reproducer Adds a test which reproduces the issue described in scylladb/scylladb#25912. The test creates a situation where a single tablet is replicated across multiple DCs / racks, and all those tablet replicas are eligible for migration. The tablet load balancer is unpaused at that moment which currently causes it to attempt to generate multiple migrations for different tablet replicas of the same tablet. Before the fix for #25912, this used to confuse the view build coordinator which would react to each migration attempt, pausing view building work for each tablet replica for which there was an attempt to migrate but only unpausing it for the tablet replica that was actually migrated. After the fix, the view build coordinator only reacts to the migration that has "won" so the test successfully passes.	2025-09-19 19:08:34 +02:00
Michał Jadwiszczak	50c5354d0b	topology_coordinator: abort view building a bit later in case of tablet migration In multi DC setup, tablet load balancer might generate multiple migrations of the same tablet_id but only one is actually commited to the `system.tablets` table. This patch moved abortion of view building tasks from the same start of the migration (`<no tablet transition> -> allow_write_both_read_old`) to the next step (`allow_write_both_read_old -> write_both_read_old`). This way, we'll abort only tasks for which the tablet migration was actually started. Fixes scylladb/scylladb#25912	2025-09-19 18:02:41 +02:00
Michał Chojnowski	9e70df83ab	db: get rid of sstables-format-selector Our sstable format selection logic is weird, and hard to follow. If I'm not misunderstanding, the pieces are: 1. There's the `sstable_format` config entry, which currently doesn't do anything, but in the past it used to disable cluster features for versions newer than the specified one. 2. There are deprecated and unused config entries for individual versions (`enable_sstables_mc_format`, `enable_sstables_md_format`, etc). 3. There is a cluster feature for each version: ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc. (Currently all sstable version features have been grandfathered, and aren't checked by the code anymore). 4. There's an entry in `system.scylla_local` which contains the latest enabled sstable version. (Why? Isn't this directly derived from cluster features anyway)? 5. There's `sstable_manager::_format` which contains the sstable version to be used for new writes. This field is updated by `sstables_format_selector` based on cluster features and the `system.scylla_local` entry. I don't see why those pieces are needed. Version selection has the following constraints: 1. New sstables must be written with a format that supports existing data. For example, range tombstones with an infinite bound are only supported by sstables since version "mc". So if a range tombstone with an infinite bound exists somewhere in the dataset, the format chosen for new sstables has to be at least as new as "mc". 2. A new format might only be used after a corresponding cluster feature is enabled. (Otherwise new sstables might become unreadable if they are sent to another node, or if a node is downgraded). 3. The user should have a way to inhibit format ugprades if he wishes. So far, constraint (1) has been fulfilled by never using formats older than the newest format ever enabled on the node. (With an exception for resharding and reshaping system tables). Constraint (2) has been fulfilled by calling `sstable_manager::set_format` only after the corresponsing cluster feature is enabled. Constraint (3) has been fulfilled by the ability to inhibit cluster features by setting `sstable_format` by some fixed value. The main thing I don't like about this whole setup is that it doesn't let me downgrade the preferred sstable format. After a format is enabled, there is no way to go back to writing the old format again. That is no good -- after I make some performance-sensitive changes in a new format, it might turn out to be a pessimization for the particular workload, and I want to be able to go back. This patch aims to give a way to downgrade formats without violating the constraints. What it does is: 1. The entry in `system.scylla_local` becomes obsolete. After the patch we no longer update or read it. As far as I understand, the purpose of this entry is to prevent unwanted format downgrades (which is something cluster features are designed for) and it's updated if and only if relevant cluster features are updated. So there's no reason to have it, we can just directly use cluster features. 2. `sstable_format_selector` gets deleted. Without the `system.scylla_local` around, it's just a glorified feature listener. 3. The format selection logic is moved into `sstable_manager`. It already sees the `db::config` and the `gms::feature_service`. For the foreseeable future, the knowledge of enabled cluster features and current config should be enough information to pick the right formats. 4. The `sstable_format` entry in `db::config` is no longer intended to inhibit cluster features. Instead, it is intended to select the format for new sstables, and it becomes live-updatable. 5. Instead of writing new sstables with "highest supported" format, (which used to be set by `sstables_format_selector`) we write them with the "preferred" format, which is determined by `sstable_manager` based on the combination of enabled features and the current value of `sstable_format`. Closes scylladb/scylladb#26092 [avi: Pavel found the reason for the scylla_local entry - it predates stable storage for cluster features]	2025-09-19 16:17:56 +03:00
Pavel Emelyanov	d1626dfa86	api: Move /storage_service/compact to tasks.cc This one doesn't have async peer there, but it's still a pure compaction manager endpoint handler Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-19 13:23:59 +03:00
Pavel Emelyanov	6eaa2138ad	api: Move /storage_service/keyspace_upgrade_sstables to tasks.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-19 13:23:54 +03:00
Pavel Emelyanov	fe2a184713	api: Move /storage_service/keyspace_offstrategy_compaction to tasks.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-19 13:23:49 +03:00
Pavel Emelyanov	607a39acbd	api: Move /storage_service/keyspace_cleanup to tasks.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-19 13:23:44 +03:00
Pavel Emelyanov	abd23bdd6d	api: Move /storage_service/keyspace_compaction to tasks.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-19 13:23:37 +03:00
Pavel Emelyanov	a1ea553fe1	code: Replace distributed<> with sharded<> The latter is recommended in seastar, and the former was left as compatibility alias. Latest seastar explicitly marks it as deprecated so once the submodule is updated, compilation logs will explode. Most of the patch is generated with for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]>') ; do sed -e 's/\<distributed<$[A-Za-z0-9:_]$>/sharded<\1>/g' -i $f; done for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done and a small manual change in test/perf/perf.hh Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26136	2025-09-19 12:22:51 +02:00
Aleksandra Martyniuk	5235e3cf67	test: limit test_streaming_deadlock_removenode concurrency test_streaming_deadlock_removenode starts 10240 inserts at once, overloading a node. Due to that test fails with timeout. Limit inserts concurrency. Fixes: #25945. Closes scylladb/scylladb#26102	2025-09-19 12:50:20 +03:00
Botond Dénes	92f614dc5a	tools/scylla-sstable-scripts: introduce writetime-histogram.lua Produces a histogram with the writetime (timestamp) of the data in the sstable(s). The histogram is printed to the output, along with general stats about the processed data.	2025-09-19 11:54:01 +03:00
Botond Dénes	d298da5410	tools/scylla-sstable-scripts: introduce purgable.lua Collects and prints statistics about how much data is purgeable in an sstable. Works only with tombstone_gc = {'mode': 'timeout'}; Can help diagnosing the efficiency (or lack of) tombstone-gc.	2025-09-19 11:53:30 +03:00
Ernest Zaslavsky	e56081d588	treewide: seastar module update and fix broken rest client start using `write_body` in `rest/client` to properly set headers due to changes applied to seastar's http client Seastar module update ``` b6be384e Merge 'http: generalize Content-Type setting' from Nadav Har'El 74472298 http: generalize request's Content-Type setting 9fd5a1cc http: generalize reply's Content-Type setting a2665f38 memory: Remove deprecated enable_abort_on_allocation_failure() d2a5a8a9 resource.cc: Remove some dead code 7ad9f424 http: Add support of multiple key repetitions for the request a636baca task: Move task::get_backtrace() definition in its class a0101efa Fixed "doxygen" spelling in error message db969482 Merge 'http/reply: introduce set_cookie()' from Botond Dénes 5357b434 http/reply: introduce set_cookie() 1ddcf05f http/reply: make write_reply*() public 4b782d73 http/connection: start_response(): fix indentation 720feca0 http/reply: encapsulate reply writing in write_reply() 3e19917d Merge 'exceptions: log thrown and propagated exception with distinct log levels' from Botond Dénes db9aea93 Merge 'Correctly wrap up abandoned yielding directory lister' from Pavel Emelyanov dbb2bf3f test: Add test for input_stream::read_exactly() a5308ec9 file/directory_lister: Correctly wrap up fallback generator 4f0811f4 file/directory_lister: Convert on-stack queue to shared pointer 59801da7 tests: Add directory lister early drop cases 33233032 http/reply: s/write_reply_to_connection/write_reply/ 69b93620 http/reply: write_reply_{to_connection,headers}(): pass output stream 56e9bda7 test: Convert directory_test into seastar test 96782358 Merge 'Improve io_tester's seqwrite and append workloads' from Pavel Emelyanov 8b46e3d4 SEASTAR_ASSERT: assert to stderr and flush stream 3370e22a tutorial.md: use current_exception_as_future() e977453a Add fixture support for seastar::testing 3e70d7f7 io_tester: Do not set append_is_unlikely unconditionally 2a4ae7b4 io_tester: Count file size overflows 5e678bb5 io_tester: Tuneup size overflow check d5dad8ce io_tester: Move position management code to io_class_data 5586a056 io_tester: Rename seqwrite -> overwrite 92df2fb2 io_tester: Relax return value of create_and_fill_file() 03d9500d io_tester: Dont fill file for APPEND d6844a7b io_tester: Indentation fix after previous patch fb9e0088 io_tester: Coroutinize create_and_fill_file() 2f802f57 exceptions: log thrown and propagated exception with distinct log levels 4971fa70 util: move log-level into own header 39448fc1 Merge 'Fix and tune http::request setup by client' from Pavel Emelyanov 52d0c4fb iostream: Move output_stream::write(scattered_message) lower 7a52f734 Merge 'read_first_line: Missing pragma and licence' from Ernest Zaslavsky d0881b7e read_first_line: Add missing license boilerplate 988a0e99 read_first_line:: Add missing `#pragma once` 42675266 http: Make client::make_request accept const request& c7709fb5 http: Make request making API return exceptional future not throw b68ed89b http: Move request content length header setup 1d96dac6 http: Move request version configuration 072e86f6 http: Setup request once ``` Closes scylladb/scylladb#25915 (cherry picked from commit `44d34663bc`) Closes scylladb/scylladb#26100	2025-09-19 11:40:59 +03:00
Botond Dénes	37e46f674d	Merge 'nodetool: ignore repair request error of colocated tables' from Michael Litvak when cluster repair is run for an entire keyspace, nodetool makes a repair api request for each table. if the keyspace contains colocated tables, then the api request for the colocated tables will fail, because currently scylla doesn't allow making repair requests for specific colocated tables, but only for base tables. if the request is to repair an entire keyspace then we can ignore this, because we will make a repair request for all base tables, and this in turn will repair also all the colocated tables in the keyspace. however if specific tables are requested and some of them are colocated then we should propagate the error to let the user know the request is invalid. Refs https://github.com/scylladb/scylladb/issues/24816 no backport - no colocated tablets in previous releases Closes scylladb/scylladb#26051 * github.com:scylladb/scylladb: nodetool: ignore repair request error of colocated tables storage_service: improve error message on repair of colocated tables	2025-09-19 06:44:23 +03:00
Nadav Har'El	7be5454db1	Merge 'alternator: Store LSI keys in :attrs for newly created tables' from Piotr Wieczorek Previously, LSI keys were stored as separate, top-level columns in the base table. This patch changes this behavior for newly created tables, so that the key columns are stored inside the `:attrs` map. Then, we use top-level computed columns instead of regular ones. This makes LSI storage consistent with GSIs and allows the use of a collection tombstone on `:attrs` to delete all attributes in a row except for keys in new tables. Refs https://github.com/scylladb/scylladb/pull/24991 Refs https://github.com/scylladb/scylladb/issues/6930 Closes scylladb/scylladb#25796 * github.com:scylladb/scylladb: alternator: Store LSI keys in :attrs for newly created tables alternator/test: Add LSI tests based mostly on the existing GSI tests	2025-09-18 21:48:43 +03:00
Karol Nowacki	bc06f89a5c	vector_store_client: Fix cleanup of client_producer factory vector_store_client::stop did not properly clean up the coroutine that was waiting for a notification on the refresh_client_cv condition variable. As a result, the coroutine could try to access `this` (via current_client) after the vector_store_client was destroyed. To fix this, the `client_producer` tasks are wrapped by a gateway. The `stop` method now signals the `client_producer` condition variable and closes the gateway, which ensures that all `client_producer` tasks are finished before the `stop` function returns. The `wait_for_signal` return type was changed from `bool` to `void` as the return value was not used. Fixes: VECTOR-230 Closes scylladb/scylladb#26076	2025-09-18 21:34:34 +03:00
Avi Kivity	fe7e63f109	Merge 'transport: service_level_controller: create and use `driver` service level' from Andrzej Jackowski This patch series: - Increases the number of allowed scheduling groups to allow creation of `sl:driver` - Implements `create_driver_service_level` that creates `sl:driver` with shares=200 if it wasn't already created - Implements creation of `sl:driver` for new systems and tests in `raft_initialize_discovery_leader` - Modifies `topology_coordinator` to use create `sl:driver` after upgrades. - Implements using `sl:driver` for new connections in `transport/server` - Adds to `transport/server` recognition of driver's control connections and forcing them to keep using `sl:driver`. - Adds tests to verify the new functionality - Modifies existing tests to let them pass after `sl:driver` is added - Modifies the documentation to contain new `sl:driver` The changes were evaluated by a test with the following scenario ([test_connections-sl-driver.py](https://github.com/user-attachments/files/22021273/test_connections-sl-driver.py)): - Start ScyllaDB with one node - Create 1000 keyspaces, 1 table in each keyspace - Start `cassandra-stress` (`-rate threads=50 -mode native cql3`) - Run connection storm with 1000 session (100 python processes, 10 sessions each) The maximum latency during connection storm dropped from 224.94ms to 41.43ms (those numbers are average from 20 test executions, were max latency was in [140ms, 361ms] before change and [31.4ms, 61.5ms] after). The snippet of cassandra-stress output from the moment of connection storm: Before: ``` type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb ... total, 789206, 85887, 85887, 85887, 0.6, 0.3, 2.0, 2.0, 2.5, 5.0, 9.0, 0.09679, 0, 0, 0, 0, 0, 0 total, 909322, 120116, 120116, 120116, 0.4, 0.2, 1.9, 2.0, 2.1, 3.1, 10.0, 0.09053, 0, 0, 0, 0, 0, 0 total, 964392, 55070, 55070, 55070, 0.9, 0.4, 2.0, 4.5, 7.7, 18.9, 11.0, 0.09203, 0, 0, 0, 0, 0, 0 total, 975705, 11313, 11313, 11313, 4.4, 3.5, 6.5, 24.5, 82.7, 83.0, 12.0, 0.11713, 0, 0, 0, 0, 0, 0 total, 987548, 11843, 11843, 11843, 4.2, 3.5, 6.5, 33.7, 48.6, 51.5, 13.0, 0.13366, 0, 0, 0, 0, 0, 0 total, 995422, 7874, 7874, 7874, 6.3, 4.0, 7.7, 85.6, 112.9, 113.5, 14.0, 0.14753, 0, 0, 0, 0, 0, 0 total, 1007228, 11806, 11806, 11806, 4.3, 3.5, 6.5, 29.1, 43.8, 87.1, 15.0, 0.15598, 0, 0, 0, 0, 0, 0 total, 1012840, 5612, 5612, 5612, 8.2, 5.0, 11.5, 121.8, 166.6, 170.1, 16.0, 0.16535, 0, 0, 0, 0, 0, 0 total, 1016186, 3346, 3346, 3346, 13.4, 7.4, 20.1, 204.9, 207.6, 210.4, 17.0, 0.17405, 0, 0, 0, 0, 0, 0 total, 1025462, 9276, 9276, 9276, 6.3, 3.9, 9.6, 74.6, 206.8, 210.0, 18.0, 0.17800, 0, 0, 0, 0, 0, 0 total, 1035979, 10517, 10517, 10517, 4.8, 3.5, 6.7, 38.5, 82.6, 83.0, 19.0, 0.18120, 0, 0, 0, 0, 0, 0 total, 1047488, 11509, 11509, 11509, 4.3, 3.5, 6.0, 32.6, 72.3, 74.0, 20.0, 0.18334, 0, 0, 0, 0, 0, 0 total, 1077456, 29968, 29968, 29968, 1.7, 1.6, 2.9, 3.6, 7.0, 8.2, 21.0, 0.17943, 0, 0, 0, 0, 0, 0 total, 1105490, 28034, 28034, 28034, 1.8, 1.8, 3.5, 4.6, 5.3, 13.8, 22.0, 0.17609, 0, 0, 0, 0, 0, 0 total, 1132221, 26731, 26731, 26731, 1.9, 1.8, 3.8, 5.2, 8.4, 11.1, 23.0, 0.17314, 0, 0, 0, 0, 0, 0 total, 1162149, 29928, 29928, 29928, 1.7, 1.7, 3.0, 4.5, 8.0, 9.1, 24.0, 0.16950, 0, 0, 0, 0, 0, 0 ... ``` After: ``` type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb ... total, 822863, 94379, 94379, 94379, 0.5, 0.3, 2.0, 2.0, 2.1, 3.7, 9.0, 0.06669, 0, 0, 0, 0, 0, 0 total, 937337, 114474, 114474, 114474, 0.4, 0.2, 2.0, 2.0, 2.1, 3.4, 10.0, 0.06301, 0, 0, 0, 0, 0, 0 total, 986630, 49293, 49293, 49293, 1.0, 1.0, 2.0, 2.1, 17.9, 19.0, 11.0, 0.07318, 0, 0, 0, 0, 0, 0 total, 1026734, 40104, 40104, 40104, 1.2, 1.0, 2.0, 2.2, 6.3, 7.1, 12.0, 0.08410, 0, 0, 0, 0, 0, 0 total, 1066124, 39390, 39390, 39390, 1.3, 1.0, 2.0, 2.2, 2.6, 3.4, 13.0, 0.09108, 0, 0, 0, 0, 0, 0 total, 1103082, 36958, 36958, 36958, 1.3, 1.1, 2.1, 2.5, 3.1, 4.2, 14.0, 0.09643, 0, 0, 0, 0, 0, 0 total, 1141987, 38905, 38905, 38905, 1.3, 1.0, 2.0, 2.4, 11.4, 12.7, 15.0, 0.09894, 0, 0, 0, 0, 0, 0 total, 1180023, 38036, 38036, 38036, 1.3, 1.0, 2.0, 3.7, 5.6, 7.1, 16.0, 0.10070, 0, 0, 0, 0, 0, 0 total, 1216481, 36458, 36458, 36458, 1.4, 1.0, 2.1, 3.6, 4.7, 5.0, 17.0, 0.10210, 0, 0, 0, 0, 0, 0 total, 1256819, 40338, 40338, 40338, 1.2, 1.0, 2.0, 2.2, 3.5, 5.4, 18.0, 0.10173, 0, 0, 0, 0, 0, 0 total, 1295122, 38303, 38303, 38303, 1.3, 1.0, 2.0, 2.4, 21.0, 21.1, 19.0, 0.10136, 0, 0, 0, 0, 0, 0 total, 1334743, 39621, 39621, 39621, 1.3, 1.0, 2.0, 2.3, 3.3, 4.0, 20.0, 0.10055, 0, 0, 0, 0, 0, 0 total, 1375579, 40836, 40836, 40836, 1.2, 1.0, 2.0, 2.1, 3.4, 5.7, 21.0, 0.09927, 0, 0, 0, 0, 0, 0 total, 1415576, 39997, 39997, 39997, 1.2, 1.0, 2.0, 2.3, 3.2, 4.1, 22.0, 0.09807, 0, 0, 0, 0, 0, 0 total, 1449268, 33692, 33692, 33692, 1.5, 1.4, 2.5, 3.2, 4.2, 5.6, 23.0, 0.09800, 0, 0, 0, 0, 0, 0 total, 1471873, 22605, 22605, 22605, 2.2, 2.0, 4.8, 5.9, 7.0, 7.9, 24.0, 0.10015, 0, 0, 0, 0, 0, 0 ... ``` Fixes: https://github.com/scylladb/scylladb/issues/24411 This is a new feature, so no backport needed. Closes scylladb/scylladb#25412 * github.com:scylladb/scylladb: docs: workload-prioritization: add driver service level test: add test to verify use of `sl:driver` transport: use `sl:driver` to handle driver's control connections transport: whitespace only change in update_scheduling_group transport: call update_scheduling_group for non-auth connections generic_server: transport: start using `sl:driver` for new connections test: add test_desc_* for driver service level test: service_levels: add tests for sl:driver creation and removal test: add reload_raft_topology_state() to ScyllaRESTAPIClient service_level_controller: automatically create `sl:driver` service_level_controller: methods to create driver service level service_level_controller: handle special sl:driver in DESC output topology_coordinator: add service_level_controller reference system_keyspace: add service_level_driver_created test: add MAX_USER_SERVICE_LEVELS	2025-09-18 19:45:17 +03:00
Karol Nowacki	b5f3f2f4c5	tools: Fix missing source file in CMake target The `json_mutation_stream_parser.cc` file was not included in the `scylla-tools` CMake target. This could lead to "undefined reference" linker errors when building with CMake. This commit adds the missing source file to the target's source list. Closes scylladb/scylladb#26108	2025-09-18 19:44:53 +03:00
Radosław Cybulski	c0db278c03	Don't report spurious keys in DescribeTable Alternator, when creating gsi, adds artificially columns, that user had not ask for. This patch prevents those columns from showing up in DescribeTable's output. Fixes #5320 Closes scylladb/scylladb#25978	2025-09-18 19:34:39 +03:00
Radosław Cybulski	6240006c5a	Fix spelling errors Closes scylladb/scylladb#26112	2025-09-18 17:37:31 +02:00
Patryk Jędrzejczak	5efc46152c	Merge 'raft_topology: Modify the conditional logic in remove node operation …' from Abhinav Kumar Jha In the current scenario, the shard receiving the remove node REST api request performs condional lock depending on whether raft is enabled or not. Since non-zero shard returns false for `raft_topology_change_enabled()`, the requests routed to non zero shards are prone to this lock which is unnecessary and hampers the ability to perform concurrent operations, which is possible for raft enabled nodes. This pr modifies the conditional lock logic and orchestrates the remove node execution logic directly to the shard0, hence the `raft_topology_change_enabled()` is now checked on the shard0 and execution is performed accordingly. Earlier, `storage_service::find_raft_nodes_from_hoeps` code threw error upon observing any non topology member present in ignore_nodes. Since we are performing concurrent remove node operations, the timing can lead to one node being fully removed before the other node remove op begins processing which can lead to runtime error in storage_service::find_raft_nodes_from_hoeps. This error throw was added to prevent users from putting random non existent nodes in ignore_nodes list. Hence made changes in that function to account for already removed nodes and ignore those nodes instead of throwing error. A test is also added to confirm the new behaviour, where concurrent remove node operations are now being performed seamlessly. This pr doesn't fix a critical bug. No need to backport it. Fixes: scylladb/scylladb#24737 Closes scylladb/scylladb#25713 * https://github.com/scylladb/scylladb: raft_topology: Modify the conditional logic in remove node operation to enhance concurrency for raft enabled clusters. storage_service: remove assumptions and checks for ignore_nodes to be normal.	2025-09-18 17:27:59 +02:00
Nadav Har'El	27c1545340	test/alternator: fix test_list_tables_paginated on DynamoDB Our list_tables() utility function, used by the test test_table.py::test_list_tables_paginated, asserts that empty pages cannot be returned by ListTables - and in fact neither DynamoDB nor Alternator returns them. But it turns out this is only true on DynamoDB's us-east-1 region, and in the eu-north-1 region, ListTables when using Limit=1 can actually return an empty last page. So let's just drop that unnecessary assertion as being wrong. In any case, this assert was in a utility function, not a test, which probably wasn't a great idea in the first place.	2025-09-18 17:46:34 +03:00
Piotr Dulikowski	fb0e5784e4	mv: fix typo in start_backgroud_fibers Letter "n" was missing in this name.	2025-09-18 15:50:16 +02:00
Piotr Dulikowski	261f61d303	mv: run view building worker fibers in streaming group The background fibers of the view building worker are indirectly spawned by the main function, thus the fibers inherit the "main" scheduling group. The main scheduling group is not supposed to be used for regular work, only for initialization and deinitialization, so this is wrong. Wrap the call to `start_backgroud_fibers()` with `with_scheduling_group` and use the streaming scheduling group. The view building worker already handles RPCs in the streaming scheduling group (which do most of the work; background fibers only do some maintenance), so this seems like a good fit.	2025-09-18 15:42:36 +02:00
Nadav Har'El	284284bf83	test/alternator: fix tests in test_tag.py on DynamoDB Until August 2024, DynamoDB's "TagResource" operation was synchronous - when it returned the tags were available for read. This is no longer true, as the new documentation says and we see in practice with many test_tag.py failing on DynamoDB. Not only do we can't read the new tags without waiting, we're not allowed to change other tags or even to delete the table without waiting. We don't need to fix Alternator for this new behavior - there is (surprisingly!) no new API to check if the tag change took affect, and it's perfectly fine that in Alternator the tags take affect immediately (when TagResource returns) and not a few seconds later. But we don't need to fix most test_tag.py tests to work with the new asynchronous API. The fix introduces convenience functions tag_resource() and untag_resource() which performs the TagResource or UntagResource operation, but also waits until the change took affect by trying ListTagsOfResources until the desired change took affect. This will make failed tests wait until the timeout (60 seconds), but that's fine - we don't expect to have failed test. After this test all tests in test/altrnator/test_tag.py pass on DynamoDB (and continue passing on Alternator). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-09-18 16:38:09 +03:00
Łukasz Paszkowski	d42d4a05fb	disk_space_monitor_test.cc: Start a monitor after fake space source function is registered When the monitor is started, the first disk utilization value is obtained from the actual host filesystem and not from the fake space source function. Thus, register a fake space source function before the monitor is started. Fixes: https://github.com/scylladb/scylladb/issues/26036 Backport is not required. The test has been added recently. Closes scylladb/scylladb#26054	2025-09-18 15:03:34 +03:00
Piotr Dulikowski	5f55787e50	Merge 'CDC with tablets' from Michael Litvak initial implementation to support CDC in tablets-enabled keyspaces. The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge. Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future. In this PR we: * add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata * add virtual tables for CDC consumers * the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata * enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet. * on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet. * the cdc tablets are co-located with the base tablets Fixes https://github.com/scylladb/scylladb/issues/22576 backport not needed - new feature update dtests: https://github.com/scylladb/scylla-dtest/pull/5897 update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102 update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136 Closes scylladb/scylladb#23795 * github.com:scylladb/scylladb: docs/dev: update CDC dev docs for tablets doc: update CDC docs for tablets test: cluster_events: enable add_cdc and drop_cdc test/cql: enable cql cdc tests to run with tablets test: test_cdc_with_alter: adjust for cdc with tablets test/cqlpy: adjust cdc tests for tablets test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests cdc: enable cdc with tablets topology coordinator: change streams on tablet split/merge cdc: virtual tables for cdc with tablets cdc: generate_stream_diff helper function cdc: choose stream in tablets enabled keyspaces cdc: rename get_stream to get_vnode_stream cdc: load tablet streams metadata from tables cdc: helper functions for reading metadata from tables cdc: colocate cdc table with base cdc: remove streams when dropping CDC table cdc: create streams when allocating tablets migration_listener: add on_before_allocate_tablet_map notification cdc: notify when creating or dropping cdc table cdc: move cdc table creation to pre_create cdc: add internal tables for cdc with tablets cdc: add cdc_with_tablets feature flag cdc: add is_log_schema helper	2025-09-18 13:39:37 +02:00
Ernest Zaslavsky	d6aa04b88a	serialization: Eliminate `cql_serialization_format.hh` Eliminate `cql_serialization_format.hh` file by inlining it into `query-request.hh` header since the content is not used anywhere but the aforementioned header Removed files: - cql_serialization_format.hh Fixes: #22108 This is a cleanup, no need to backport Closes scylladb/scylladb#25087	2025-09-18 13:17:56 +03:00
Nadav Har'El	3afe078d24	test/alternator: fix test_health_only_works_for_root_path on DynamoDB test_health.py::test_health_only_works_for_root_path checks that while http://ourserver/ is a valid health-check URL, taking other silly strings at the end, like http://ourserver/abc - is NOT valid and results in an error. It turns out that for one of the silly strings we chose to test, "/health", DynamoDB started recently NOT to return an error, and instead return an empty but successful response. In fact, it does this for every string starting with /health - including "/healthz". Perhaps they did this for some sort of Kubernetes compatibility, but in any case this behavior isn't documented and we don't need to emulate it. For now, let's just remove the string "/health" from our test so the test doesn't fail on DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-09-18 12:48:42 +03:00
Nadav Har'El	5bd503ad43	test/alternator: reproducer tests for faux GSI range key problem In issue #5320 we noticed that when we have a GSI with a hash key only (no range key) but the implementation's MV needs to add a clustering key for the original base key columns, the output of DescribeTable wrongly lists that exta "range key" - which isn't a real range key of the GSI. It turns out that the fact that the extra attribute is not a real GSI range key has another implication: It should not be allowed in KeyConditions or KeyConditionExpression - which should allow only real key columns of the GSI. This patch adds two reproducing tests for this issue (issue #2601), both pass on DynamoDB but xfail on Alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-09-18 12:17:21 +03:00
Avi Kivity	f6b6312cf4	Merge 'sstables/trie: prepare for integrating BTI indexes with sstable readers and writers' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: introducing the new components, Partitions.db and Rows.db This is the preparatory, uncontroversial part of https://github.com/scylladb/scylladb/pull/26039, which has been split out to a separate PR to make the main part (which, after a revision, will be posted later) smaller. This series contains several small fixes and changes to BTI-related code added earlier, which either have to be done (i.e. propagating `reader_permit` to IO calls in index reads) or just deserved to be done. There's no single theme for the changes in this PR, refer to the individual commits for details. The changes are for the sake of new and unreleased code. No backporting should be done. Closes scylladb/scylladb#26075 * github.com:scylladb/scylladb: sstables/mx/reader: remove mx::make_reader_with_index_reader test/boost/bti_index_test: fix indentation sstables/trie/bti_index_reader: in last_block_offset(), return offset from the beginning of partition, not file sstables/trie: support reader_permit and trace_state properly sstables/trie/bti_node_reader: avoid calling into `cached_file` if the target position is already cached sstables/trie/bti_index_reader: get rid of the seastar::file wrapper in read_row_index_header sstables/trie/bti_index_reader: support BYPASS CACHE test/boost/bti_index_test: use read_bti_partitions_db_footer where appropriate sstables/trie: change the signature of bti_partition_index_writer::finish sstables/bti_index: improve signatures of special member functions in index writers streaming/stream_transfer_task: coroutinize `estimate_partitions()` types/comparable_bytes: add a missing implementation for date_type_impl sstables: remove an outdated FIXME storage_service: delete `get_splits()` sstables/trie: fix some comment typos in bti_index_reader.cc sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone	2025-09-18 12:10:27 +03:00
Botond Dénes	edaf67edcb	tools/scylla-sstable: remove writetime-histogram command This command was written for an investigation and was used exactly once. This would have been a perfect candidate for the (also rarely used) scylla-sstable script command, but it didn't exist yet. Drop this command from the tool, such super-specific commands should be written as sstable-scripts nowadays, which is what we will do if we ever need this again. Closes scylladb/scylladb#26062	2025-09-18 12:05:54 +03:00
Nadav Har'El	0b30688641	test/alternator: fix test "test_17119a" to pass on DynamoDB As noticed in issue #26079 the Alternator test test_gsi.py::test_17119a fails on DynamoDB. The problem was that the test added to KeyConditions reading from a GSI an unnecessary attribute - one which was accidentally allowed by Alternator (Refs #26103) but not allowed by DynamoDB. This is easy to fix - just remove the unnecessary attribute from KeyConditions, and the test still works properly and passes on both DynamoDB and Alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-09-18 11:38:35 +03:00
Piotr Dulikowski	4ed045a15c	Merge 'db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr`' from Michał Jadwiszczak When a staging sstable is registered to view building worker, it needs to make a round trip from its original shard to shard 0 (in order to create a view building task) and back (to be eventually processed). Until now this was done using plain `sstables::shared_sstable` (= `lw_shared_ptr`) which is not safe to be moved between shards. This patch fixes this by wrapping the pointer in `foreign_ptr` and obtains necessary informations (owner shard, last token) on the original shard (instead of on shard0). Then all of those objects are put into freshly introduced structure `staging_sstable_task_info`, which can be safely moved between shards. Fixes https://github.com/scylladb/scylladb/issues/25859 View building coordinator isn't present in any release yet, no backport needed. Closes scylladb/scylladb#25832 * github.com:scylladb/scylladb: db/view/view_building_worker: fix indent db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr` db/view/view_building_worker: use table id in `register_staging_sstable_tasks()` db/view/view_building_worker: move helper functions higher	2025-09-18 10:24:27 +02:00
Piotr Dulikowski	b71af71ab5	Merge 'db/view/view_building_worker: change `sharded<abort_source>` to local `abort_source`' from Michał Jadwiszczak Previously the sharded abort_sources was stopped at the end of `batch::do_work()`, which is working in parallel to view building worker main loop. This leads to races because the worker may call `batch::abort()`, which access the abort_sources. This patch solves this be changing `sharded<abort_source>` into `abort_source`. Since now `batch::do_work()` is executed on tasks' shard, all abort source checks are also done on tasks' shard. The only place where shard0 uses the abort source is `batch::abort()`, but this method now does `smp::submit_to(replica.shard, [request abort])`, so the abort source is used on tasks' shard exclusively. Fixes https://github.com/scylladb/scylladb/issues/25805 Fixes https://github.com/scylladb/scylladb/issues/26045 View building coordinator hasn't been released yet, so no backport needed. Closes scylladb/scylladb#26059 * github.com:scylladb/scylladb: db/view/view_building_worker: fix indents db/view/view_building_worker: change `sharded<abort_source>` to local `abort_source` db/view/view_building_worker: execute entire `batch::do_work` on tasks shard db/view/view_building_worker: store reference to sharded worker in batch	2025-09-18 10:11:20 +02:00
Michael Litvak	aae91330b0	nodetool: ignore repair request error of colocated tables when cluster repair is run for an entire keyspace, nodetool makes a repair api request for each table. if the keyspace contains colocated tables, then the api request for the colocated tables will fail, because currently scylla doesn't allow making repair requests for specific colocated tables, but only for base tables. if the request is to repair an entire keyspace then we can ignore this, because we will make a repair request for all base tables, and this in turn will repair also all the colocated tables in the keyspace. however if specific tables are requested and some of them are colocated then we should propagate the error to let the user know the request is invalid. Refs scylladb/scylladb#24816	2025-09-18 09:35:53 +02:00
Michael Litvak	eeaa64ca0e	storage_service: improve error message on repair of colocated tables currently repair requests can't be added or deleted on non-base colocated tables. improve the error message and comments to be more clear and detailed.	2025-09-18 09:35:53 +02:00
Andrzej Jackowski	757dca3bc8	docs: workload-prioritization: add driver service level Refs: scylladb/scylladb#24411	2025-09-18 09:29:37 +02:00
Andrzej Jackowski	452313f5a5	test: add test to verify use of `sl:driver` `sl:driver` is expected to be used for new and control connections, but other connections that run user load should not use it after the user is authenticated. Refs: scylladb/scylladb#24411	2025-09-18 09:29:37 +02:00
Andrzej Jackowski	c02535635e	transport: use `sl:driver` to handle driver's control connections Before `sl:driver` was introduced, service levels were assigned as follows: 1. New connections were processed in `main`. 2. After user authentication was completed, the connection's SL was changed to the user's SL (or `sl:default` if the user had no SL). This commit introduces `service_level_state` to `client_state` and implements the following logic in `transport/server`: 1. If `sl:driver` is not present in the system (for example, it was removed), service levels behave as described above. 2. If `sl:driver` is present, the flow is: I. New connections use `sl:driver`. II. After user authentication is completed, the connection's SL is changed to the user's SL (or `sl:default`). III. If a REGISTER (to events) request is handled, the client is processing the control connection. We mark the client_state to permanently use `sl:driver`. The aforementioned state `2.III` is represented by `_control_connection` flag in `client_state`. Fixes: scylladb/scylladb#24411	2025-09-18 09:29:37 +02:00
Andrzej Jackowski	49aa7613ae	transport: whitespace only change in update_scheduling_group The indentation is changed because it will be required in the next commit of this patch series.	2025-09-18 09:29:37 +02:00
Andrzej Jackowski	43472e8633	transport: call update_scheduling_group for non-auth connections Before this change, unauthorized connections stayed in `main` scheduling group. It is not ideal, in such case, rather `sl:default` should be used, to have a consistent behavior with a scenario where users is authenticated but there is no service level assigned to the user. This commit adds a call to `update_scheduling_group` at the end of connection creation for an unauthenticated user, to make sure the service level is switched to `sl:default`. Fixes: scylladb/scylladb#26040	2025-09-18 09:29:37 +02:00
Andrzej Jackowski	1ad483749a	generic_server: transport: start using `sl:driver` for new connections Before this change, new connections were handled in a default scheduling group (`main`), because before the user is authenticated we do not know which service level should be used. With the new `sl:driver` service level, creation of new connections can be moved to `sl:driver`. We switch the service level as early as possible, in `do_accepts`. There is a possibility, that `sl:driver` will not exist yet, for instance, in specific upgrade cases, or if it was removed. Therefore, we also switch to `sl:driver` after a connection is accepted. Refs: scylladb/scylladb#24411	2025-09-18 09:29:29 +02:00
Andrzej Jackowski	e1b4a338ba	test: add test_desc_* for driver service level Driver service level is a special service level that is created automatically by the system. Therefore, it requires special handling in DESC SCHEMA WITH INTERNALS and those test verifies the special behavior. Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	43a0eb7b0b	test: service_levels: add tests for sl:driver creation and removal Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	4af270a271	test: add reload_raft_topology_state() to ScyllaRESTAPIClient To encapsulate `/storage_service/raft_topology/reload` API call	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	6f678a2d1f	service_level_controller: automatically create `sl:driver` This commit: - Increases the number of allowed scheduling groups to allow the creation of `sl:driver`. - Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating `sl:driver` until all nodes have increased the number of scheduling groups. - Starts using `get_create_driver_service_level_mutations` to unconditionally create `sl:driver` on `raft_initialize_discovery_leader`. The purpose of this code path is ensuring existence of `sl:driver` in new system and tests. - Starts using `migrate_to_driver_service_level` to create `sl:driver` if it is not already present. The creation of `sl:driver` is managed by `topology_coordinator`, similar to other system keyspace updates, such as the `view_builder` migration. The purpose of this code path is handling upgrades. - Modifies related tests to pass after `sl:driver` is added. Later in this patch series, `sl:driver` will be used by `transport/server` to handle selected traffic, such as the driver's schema and topology fetches. Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	6a911bff3f	service_level_controller: methods to create driver service level This commit implements `get_create_driver_service_level_mutations` and `migrate_to_driver_service_level` in service_level_controller. Both methods create `sl:driver` with shares=200 and store this fact in `system.scylla_local`. Both methods will be used later in this patch series for automatic creation of sl:driver. Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	5cb4577800	service_level_controller: handle special sl:driver in DESC output Later in this patch series, `sl:driver` will be added as a special service level created automatically by the system. It needs special handling in `DESC SCHEMA ...` to ensure that during backup restore: 1. CREATE SERVICE LEVEL does not fail if `sl:driver` already exists 2. If `sl:driver` exists, its configuration is fully restored (emit ALTER SERVICE LEVEL). 3. If `sl:driver` was removed, the information is retained (emit DROP SERVICE LEVEL instead of CREATE/ALTER). Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	09c8f67e69	topology_coordinator: add service_level_controller reference This adds a reference to sl_controller so that, later in this patch series, topology_coordinator can manage creating `sl:driver` once group0 is fully operational. Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	dd9b4c64d2	system_keyspace: add service_level_driver_created This commit extends sytem.scylla_local table with an additional key/value pair that can be used later in this patch series to keep an information that `sl:driver` was already created. The purpose of storing this information is to ensure that `sl:driver` is not recreated after being intentionally removed. A new mutation is included in `register_raft_pull_snapshot` to keep `service_level_driver_created` in state machine shapshot, which is required for proper propagation of the value when a new node is added to the cluster. Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	d30590c1d0	test: add MAX_USER_SERVICE_LEVELS Previously, tests used the hardcoded value 7 for the maximum number of user service levels. This commit introduces a named variable that can be shared across tests to avoid cases where this magic number goes out of sync.	2025-09-18 09:28:32 +02:00
Nadav Har'El	22f88bff30	test/alternator: fix test to pass on DynamoDB As noticed in issue #26079, the Alternator test test_number.py::test_invalid_numbers failed on DynamoDB, because one of the things it did, as a "sanity check", was to check that the number 0e1000 was a valid number. But it turns out it isn't allowed by DynamoDB. So this patch removes 0e1000 from the list of valid numbers in test_invalid_numbers, and instead creates a whole new test for the case of 0e1000. It turns out that DynamoDB has a bug (it appears to be a regression, because test_invalid_numbers used to pass on DynamoDB!) where it allows 0.0e1000 (since it's just zero, really!) but forbids 0e1000 which is incorrectly considered to have a too-large magnitude. So we introduce a test that confirms that Alternator correctly allows both 0.0e1000 and 0e1000. DynamoDB fails this test (it allows the first, forbidding the second), making it the first Alternator test tagged as a "dynamodb_bug". Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-09-18 10:28:01 +03:00
Pawel Pery	12f04edf22	vector_store_client: rename embedding into vs_vector According to the changes in Vector Store API (VECTOR-148) the `embedding` term should be changed to `vector`. As `vector` term is used for STL class the internal type or variable names would be changed to `vs_vector` (aka vector store vector). This patch changes also the HTTP ann json request payload according to the Vector Store API changes. Fixes: VECTOR-229 Closes scylladb/scylladb#26050	2025-09-18 08:45:46 +03:00
Ferenc Szili	de5dab8429	docs: add capacity based balancing explanation Capacity based balancing was introduced in 2025.1. It computes balance based on a node's capacity: the number of tablets located on a node should be directly proportional to that node's storage capacity. This change adds this explanation to the docs. Fixes: #25686 Closes scylladb/scylladb#25687	2025-09-18 08:14:04 +03:00
Yauheni Khatsianevich	adc4a4e15a	test: new test for LWT testing during tablets migration E2E test runs multi-column CAS workload (LOCAL_QUORUM/LOCAL_SERIAL) while tablets are repeatedly migrated between nodes. Uncertainty timeouts are resolved via LOCAL_SERIAL reads; guards use max(row, lower_bound). Final assertion: s{i} per (pk,i) equals the count of confirmed CAS by worker i (no lost/phantom updates) despite tablet moves. Closes scylladb/scylladb#25402	2025-09-18 08:11:12 +03:00
Ernest Zaslavsky	ddf2588985	treewide: Move replica related files to `replica` directory As requested in #22099, moved the files and fixed other includes and build system. Moved files: - cache_temperature.hh - cell_locking.hh Fixes: #22099 Closes scylladb/scylladb#25079	2025-09-18 08:00:35 +03:00
Pavel Emelyanov	65638232e8	Merge 'utils: azure: Catch system errors when probing IMDS and bump the verbosity of logs' from Nikos Dragazis This PR fixes a bug in the Azure default credential provider that would cause the `test_azure_provider_with_incomplete_creds` unit test to be flaky. The provider would assume that an unreachable IMDS endpoint would always result in a timeout, but network errors are also possible (e.g., ICMP "host unreachable"). The issue is triggered by this particular test because it sets the IMDS endpoint to a non-routable address. Some routers choose to silently drop such packets, while others return ICMP errors. To fix it, the default credential provider has been updated to catch system errors as well. This PR also raises the log level of the default credential provider from DEBUG to INFO, making it easier for operators to diagnose authentication issues. More details in the commit messages. Fixes #25641. Closes scylladb/scylladb#25696 * github.com:scylladb/scylladb: utils: azure: Catch system errors when detecting IMDS utils: azure: Bump default credential logs from DEBUG to INFO	2025-09-18 07:43:00 +03:00
Nadav Har'El	3c969e2122	cql: document and test permissions on materialized views and CDC We were recently surprised (in pull request #25797) to "discover" that Scylla does not allow granting SELECT permissions on individual materialized views. Instead, all materialized views of a base table are readable if the base table is readable. In this patch we document this fact, and also add a test to verify that it is indeed true. As usual for cqlpy tests, this test can also be run on Cassandra - and it passes showing that Cassandra also implemented it the same way (which isn't surprising, given that we probably copied our initial implementation from them). The test demonstrates that neither Scylla nor Cassandra prints an error when attempting to GRANT permissions on a specific materialized view - but this GRANT is simply ignored. This is not ideal, but it is the existing behavior in both and it's not important now to change it. Additionally, because pull request #25797 made CDC-log permissions behave the same as materialized views - i.e., you need to make the base table readable to allow reading from the CDC log, this patch also documents this fact and adds a test for it also. Fixes #25800 Closes scylladb/scylladb#25827	2025-09-18 07:41:35 +03:00
Botond Dénes	839056b648	docs/operating-scylla: scylla-sstable.rst: update write docs scylla-sstable write (and scrub) moved to UUID generations in `514f59d157`, but said patch forgot to update the docs. This is fixed here. Closes scylladb/scylladb#25965	2025-09-18 07:39:50 +03:00
Ernest Zaslavsky	c9c245c756	rest_client: set `version` on http::request to avoid invalid state Upcoming changes in Seastar cause `rest::simple_send` to move the `http::request` into `seastar::http::experimental::client::make_request` when called multiple times. This leaves the original request in an invalid state. Specifically, the `_version` field becomes empty, causing request validation to fail. This patch ensures `version` is explicitly set to prevent such failures. Fixes: https://github.com/scylladb/scylladb/issues/26018 Closes scylladb/scylladb#26066	2025-09-18 07:36:25 +03:00
Michał Jadwiszczak	1d8b41a51d	db/view/view_building_worker: fix indents	2025-09-18 03:24:43 +02:00
Michał Jadwiszczak	99db5a6c30	db/view/view_building_worker: change `sharded<abort_source>` to local `abort_source` Previously the sharded abort_sources was stopped at the end of batch::do_work(), which is working in parallel to view building worker main loop. This leads to races because the worker may call batch::abort(), which access the abort_sources. This patch solves this be changing `sharded<abort_source>` into `abort_source`. Since now `batch::do_work()` is executed on tasks' shard, all abort source checks are also done on tasks' shard. The only place where shard0 uses the abort source is `batch::abort()`, but this method now does `smp::submit_to(replica.shard, [request abort])`, so the abort source is used on tasks' shard exclusively. Fixes scylladb/scylladb#25805 Fixes scylladb/scylladb#26045	2025-09-18 03:24:43 +02:00
Michał Jadwiszczak	7b9db335c0	db/view/view_building_worker: execute entire `batch::do_work` on tasks shard This change will allow us to get rid of problematic `sharded<abort_source>` and use local `abort_source` instead.	2025-09-18 03:24:43 +02:00
Michał Jadwiszczak	2f65af8aa7	db/view/view_building_worker: store reference to sharded worker in batch Change reference to view building worker in batch to sharded container. In next commits, I'm going to execute `do_work()` exclusively on tasks target shard and sharded reference will be more useful.	2025-09-18 03:24:43 +02:00
Michał Jadwiszczak	e4a0de53ea	db/view/view_building_worker: fix indent	2025-09-18 02:57:36 +02:00
Michał Jadwiszczak	7dfb76f9a7	db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr` When a staging sstable is registered to view building worker, it needs to make a round trip from its original shard to shard 0 (in order to create a view building task) and back (to be eventually processed). Until now this was done using plain `sstables::shared_sstable` (= `lw_shared_ptr`) which is not safe to be moved between shards. This patch fixes this by wrapping the pointer in `foreign_ptr` and obtains necessary informations (owner shard, last token) on the original shard (instead of on shard0). Then all of those objects are put into freshly introduced structure `staging_sstable_task_info`, which can be safely moved between shards. Fixes scylladb/scylladb#25859	2025-09-18 02:57:36 +02:00
Michał Jadwiszczak	50678030c0	db/view/view_building_worker: use table id in `register_staging_sstable_tasks()` There is no need to pass the pointer only to get id of the table.	2025-09-18 02:57:35 +02:00
Michał Jadwiszczak	b44c223d47	db/view/view_building_worker: move helper functions higher So they can be used in `view_building_worker::register_staging_sstable_tasks()`.	2025-09-18 02:57:35 +02:00
Wojciech Mitros	f17beba834	load_balancer: include dead nodes when calculating rack load Load balancer aims to preserve a balance in rack loads when generating tablet migrations. However, this balance might get broken when dead nodes are present. Currently, these nodes aren't include in rack load calculations, even if they own tablet replicas. As a result, load balancer treats racks with dead nodes as racks with a lower load, so I generates migrations to these racks. This is incorrect, because a dead node might come back alive, which would result in having multiple tablet replicas on the same rack. It's also inefficient even if we know that the node won't come back - when it's being replaced or removed. In that case we know we are going to rebuild the lost tablet replicas so migrating tablets to this rack just doubles the work. Allowing such migrations to happen would also require adjustments in the materialized view pairing code because we'd temporarily allow having multiple tablet replicas on the same rack. So in this patch we include dead nodes when calculating rack loads in the load balancer. The dead nodes still aren't treated as potential migration sources or destinations. We also add a test which verifies that no migrations are performed by doing a node replace with a mv workload in parallel. Before the patch, we'd get pairing errors and after the patch, no pairing errors are detected. Fixes https://github.com/scylladb/scylladb/issues/24485 Closes scylladb/scylladb#26028	2025-09-17 20:49:18 +02:00
Avi Kivity	3acfc577d8	Merge 'tools/scylla-sstable: extract json mutation stream parser into own hh,cc' from Botond Dénes tools/scylla-sstable.cc has 3.5k SLOC, out of which this class alone is 1K. Extract into own hh and cc. Since this class was already using pimpl, the header remains nice and small. Code cleanup, no backport needed. Closes scylladb/scylladb#26064 * github.com:scylladb/scylladb: tools: extract json_mtuation_stream_parser to its own hh,cc files tools/scylla-sstable: fix indentation tools/scylla-sstable: prepare for extracting json_mutation_stream_parser	2025-09-17 18:30:30 +03:00
Ernest Zaslavsky	54aa552af7	treewide: Move type related files to a `type` directory As requested in #22110 , moved the files and fixed other includes and build system. Moved files: - duration.hh - duration.cc - concrete_types.hh Fixes: #22110 This is a cleanup, no need to backport Closes scylladb/scylladb#25088	2025-09-17 17:32:19 +03:00
Ernest Zaslavsky	a1f18a8883	treewide: Move schema related files to a `schema` directory As requested in #22111 , moved the files and fixed other includes and build system. Moved files: - frozen_schema.hh - frozen_schema.cc - schema_mutations.hh - schema_mutations.cc - column_computation.hh Fixes: #22111 Closes scylladb/scylladb#25089	2025-09-17 17:31:05 +03:00
Botond Dénes	bde7d8ddbd	Merge 'service: pass current session_id to repair rpc' from Aleksandra Martyniuk Currently, in repair_tablet we retrieve session_id from tablet map (and throw if it isn't specified). In case of topology coordinator failover, we may end up in a situation where a node runs outdated repair, treating session of a different operation as the repair's session: - topology coordinator starts repair transition (A); - topology coordinator sends tablet repair rpc to node1; - topology coordinator is separated from the cluster; - new topology coordinator is elected; - new topology coordinator sees waiting repair request (A_2) and executes it; - new repair of the same tablet is requested (B); - new topology coordinator starts repair transition (B); - new topology coordinator sends tablet repair rpc to node2; - node2 starts repair (B) as repair master; - node1 starts repair (A), checks the current session (B), proceeds with repair (B) as repair master. Send current session_id in repair_tablet rpc. If this session_id and session id got from tablet map don't match, an exception is thrown. Fixes: https://github.com/scylladb/scylladb/issues/23318. No backport; changes in rpc signatures Closes scylladb/scylladb#25369 * github.com:scylladb/scylladb: test: check that repair with outdated session_id fails service: pass current session_id to repair rpc	2025-09-17 17:28:35 +03:00
Botond Dénes	cc5153ef8c	Merge 'db: cache: consider preempting after each partition' from Aleksandra Martyniuk Currently, during cache invaldation we check if we need to preempt only after the partition gets invaldaited. This may lead to stalls if we have a chain of filtered out partitions. Check for preemption even if the partition does not get invaldated. Refs: https://github.com/scylladb/scylladb/issues/9136. Optimization; no backport Closes scylladb/scylladb#26053 * github.com:scylladb/scylladb: db: fix indentation db: cache: consider preempting after each partition	2025-09-17 17:26:29 +03:00
Botond Dénes	a8d22a66fa	Merge 'Improve Encryption at Rest documentation' from Nikos Dragazis This PR introduces a major rewrite of the EaR document. The initial motivation for this PR was to fully cover all our supported key providers with working examples, and to add instructions for key rotation. However, many other improvements were made along the way. Main changes in this PR: * Add a high-level description for every key provider. Mention limitations. * Better organize existing provider-specific instructions by placing them into clearly separated, tabbed sections. * Add instructions for the replicated key provider. Mention explicitly that it cannot be used as default option for user or system encryption, and that it does not support key rotation. * Add more examples for KMS and GCP to cover all credential types. * Document missing configuration options. * Add a new section for key rotation. Notes: * Some of the patches in this series have been cherry-picked from Laszlo's wip branch. * This PR is expected to conflict with the Azure Key Vault PR, which should be merged first. (https://github.com/scylladb/scylladb/pull/23920/) * Support for KMIP system keys in the Replicated Key Provider is currently broken. (https://github.com/scylladb/scylladb/issues/24443) Fixes scylladb/scylla-enterprise#3535. Refs scylladb/scylla-enterprise#3183. Only doc changes. No backport is needed. Closes scylladb/scylladb#24558 * github.com:scylladb/scylladb: encryption-at-rest.rst: add "Rotate Encryption Keys" section encryption-at-rest.rst: rewrite "Encrypt System Resources" section encryption-at-rest.rst: rewrite "Update Encryption Properties of Existing Tables" section encryption-at-rest.rst: rewrite "Encrypt a Single Table" section encryption-at-rest.rst: rewrite "Encrypt Tables" section encryption-at-rest.rst: update "Set the Azure Host" section encryption-at-rest.rst: update "Set the GCP Host" section encryption-at-rest.rst: update "Set the KMS Host" section encryption-at-rest.rst: update "Set the KMIP Host" section encryption-at-rest.rst: rewrite "Create Encryption Keys" section encryption-at-rest.rst: rewrite "Key Providers" section encryption-at-rest.rst: hoist and update "Cipher Algorithm Descriptors" encryption-at-rest.rst: rewrite/replace section "Encryption Key Types" encryption-at-rest.rst: About: describe high-level operation more precisely encryption-at-rest.rst: improve wording / formatting in About intro encryption-at-rest.rst: users (plural) typo fix encryption-at-rest.rst: rewrap encryption-at-rest.rst: strip trailing whitespace	2025-09-17 17:25:25 +03:00
Nadav Har'El	d63fdd1e8b	test/cqlpy: fix run-cassandra to run with Java 21 The script test/cqpy/run-cassandra aims to make it easy to run any version of Cassandra using whatever version of Java the user has installed. Sadly, the fact that Java keeps changing and the Cassandra developers are very slow to adapt to new Javas makes doing this non-trivial. This patch makes it possible for run-cassandra to run Cassandra 5 on the Java 21 that is now the default on Fedora 42. Fedora 42 no longer carries antique version of Java (like Java 8 or 11), not even as an optional package. Sadly, even with this patch it is not possible to run older versions of Cassandra (4 and 3) with Java 21, because the new Java is missing features such as Netty that the older Cassandra require. But at least it restores the ability to run our cqlpy tests against Cassandra 5. Also, this patch adds to test/cqlpy/README.md simple instructions on how to install Java 11 (in addition to the system's default Java 21) on Fedora 42. Doing this is very easy and very recommended because it restores the ability to run Cassandra 3 and 4, not just Cassandra 5. Fixes #25822. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25825	2025-09-17 17:24:47 +03:00
Botond Dénes	85f6eeda30	Merge 'compaction/scrub: register sstables for compaction before validation' from Lakshmi Narayanan Sreethar compaction/scrub: register sstables for compaction before validation When `scrub --validate` runs, it collects all candidate sstables at the start and validates them one by one in separate compaction tasks. However, scrub in validate mode does not register these sstables for compaction, which allows regular compaction to pick them up and potentially compact them away before validation begins. This leads to scrub failures because the sstables can no longer be found. This patch fixes the issue by first disabling compaction, collecting the sstables, and then registering them for compaction before starting validation. This ensures that the enqueued sstables remain available for the entire duration of the scrub validation task. Fixes #23363 This reported scrub failure occurs on all versions that have the checksum/digest validation feature for uncompressed sstables. So, backport it to older versions. Closes scylladb/scylladb#26034 * github.com:scylladb/scylladb: compaction/scrub: register sstables for compaction before validation compaction/scrub: handle exceptions when moving invalid sstables to quarantine	2025-09-17 17:22:00 +03:00
Piotr Smaron	bdb90ee15c	set ssl_* columns in system.clients Depends on https://github.com/scylladb/seastar/pull/2651 Missing columns have been present since probably forever - they were added to the schema but never assigned any value: ``` cqlsh> select * from system.clients; ------------------+------------------------ ... ssl_cipher_suite \| null ssl_enabled \| null ssl_protocol \| null ... ``` This patch sets values of these columns: - with a TLS connection, the 3 TLS-related fields are filled in, - without TLS, `ssl_enabled` is set to `false` and other columns are `null`, - if there's an error while inspecting TLS values, the connection is dropped. We want to save the TLS info of a connection just after accepting it, but without waiting for a TLS handshake to complete, so once the connection is accepted, we're inspecting it in the background for the server to be able to accept next connections immediately. Later, when we construct system.clients virtual table, the previously saved data can be instantaneously assigned to client_data, which is a struct representing a row in system.clients table. This way we don't slow down constructing this table by more than necessary, which is relevant for cases with plenty of connections. Fixes: #9216 Closes scylladb/scylladb#22961	2025-09-17 16:29:55 +03:00
Nadav Har'El	3c0032deb4	alternator: fix bug in combination of AttributeUpdates + ReturnValues In test/alternator/test_returnvalues.py we had tests for the ReturnValues feature on UpdateItem requests - but we only tested UpdateItem requests with the "modern" UpdateExpression, and forgot to test the combination of ReturnValues with the old AttributeUpdates API. It turns out this combination is buggy: when both ReturnValues=ALL_OLD and AttributeUpdates need the previous value of the item, we may wrongly std::move() the value out, and the operation will fail with a strange error: An error occurred (ValidationException) when calling the UpdateItem operation: JSON assert failed on condition 'IsObject()' The fix in this patch is trivial - just move the std::move() to the correct place, after both UpdateExpression and AttributeUpdates handling is done. This patch also includes a reproducing test, which fails before this patch and passes with it - and of course passes on DynamoDB. This test reproduces two cases where the bug happened, as well as one case where it didn't (to make sure we don't regress in what already worked). Fixes #25894 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25900	2025-09-17 16:04:01 +03:00
Michael Litvak	ef0b34ec9d	docs/dev: update CDC dev docs for tablets	2025-09-17 14:47:13 +02:00
Michael Litvak	acd0eebd54	doc: update CDC docs for tablets Now that CDC is enabled for tablets-based keyspaces, update the docs and added explanations about the differences.	2025-09-17 14:47:13 +02:00
Michael Litvak	221a687ec9	test: cluster_events: enable add_cdc and drop_cdc add_cdc and drop_cdc were skipped because CDC wasn't supported with tablets. now that CDC is supported with tablets we should unskip it.	2025-09-17 14:47:13 +02:00
Michael Litvak	73d05c7214	test/cql: enable cql cdc tests to run with tablets	2025-09-17 14:47:13 +02:00
Michael Litvak	001c55c213	test: test_cdc_with_alter: adjust for cdc with tablets previously the test set tablets to disabled because cdc wasn't supported with tablets. now we can change this to use the default to enable it to run with either tablets or vnodes.	2025-09-17 14:47:13 +02:00
Michael Litvak	778dec2630	test/cqlpy: adjust cdc tests for tablets update cdc-related tests in test/cqlpy for cdc with tablets. * test_cdc_log_entries_use_cdc_streams: this test depends on the implementation of the cdc tables, which is different for tablets, so it's changed to run for both vnodes and tablets keyspaces, and we add the implementation for tablets. * some cdc-related are unskipped for tablets so they will be run with both tablets and vnodes keyspaces. these are tests where the implementation may be different between tablets and vnodes and we want to have converage of both. * other cdc-related tests do not depend on the implementation differences between tablets and vnodes, so we can just enable them to run with the default configuration. previously they were disabled for tablets keyspaces because it wasn't supported, so now we remove this.	2025-09-17 14:47:13 +02:00
Michael Litvak	5a87d0f6c9	test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests Introduce basic tests creating CDC tables in tablets-enabled keyspaces, verifying we can create and drop CDC tables, write and consume CDC log entries, and consume the log while splitting streams.	2025-09-17 14:47:13 +02:00
Michael Litvak	1fc3273b27	cdc: enable cdc with tablets Allow to create CDC tables in a tablets-enabled keyspace when all nodes in the cluster support the cdc_with_tablets feature. Fixes scylladb/scylladb#22576	2025-09-17 14:47:12 +02:00
Michael Litvak	98de3f0e86	topology coordinator: change streams on tablet split/merge on tablet split/merge finalization, generate a new CDC timestamp and stream set for the table with a new stream for each tablet in the new tablet map, in order to maintain synchronization of the CDC streams with the tablets. We pick a new timestamp for the streams with a small delay into the future so that all nodes can learn about the new streams in time, in the same way it's done for vnodes. the new timestamp and streams are published by adding a mutation to the cdc_streams_history table that contains the timestamp and the sets of closed and opened streams from the current timestamp.	2025-09-17 14:47:12 +02:00
Michael Litvak	b8442c6087	cdc: virtual tables for cdc with tablets Define two new virtual tables in system keyspace: cdc_timestamps and cdc_streams. They expose the internal cdc metadata for tablets-enabled keyspace to be consumed by users consuming the CDC log. cdc_timestamps lists all timestamps for a table where a stream change occured. cdc_streams list additionally the current streams sets for each table and timestamp, as well as difference - closed and opened streams - from the previous stream set.	2025-09-17 14:47:12 +02:00
Michael Litvak	67410cac4d	cdc: generate_stream_diff helper function This helper functions receives two sets of streams and constructs their difference - closed and opened streams.	2025-09-17 14:47:12 +02:00
Michael Litvak	5f6bb0af9d	cdc: choose stream in tablets enabled keyspaces When choosing a CDC stream to generate CDC log writes to, if the keyspace uses tablets, we need to choose a stream according to the relevant metadata which is specific to tablets-enabled keyspaces. We define the method get_tablet_stream that given a table, write timestamp, and token, returns the stream that the log entry should be written to. The method works by looking up the stream metadata of the table, then finding the relevant stream set by timestamp, and finally finding the stream that covers the token range that contains the token.	2025-09-17 14:47:12 +02:00
Michael Litvak	28cdd81ef0	cdc: rename get_stream to get_vnode_stream the get_stream method is relevant only for vnode-based keyspaces. next we will introduce a new method to get a stream in a tablets-based keyspace. prepare for this by renaming get_stream to get_vnode_stream.	2025-09-17 14:47:12 +02:00
Michael Litvak	9ec4b6ccb1	cdc: load tablet streams metadata from tables Read the CDC stream metadata from the internal system tables, and store it in the cdc metadata data structures. The metadata is stored in the tables as diffs which is more storage efficient, but when in-memory we store it as full stream sets for each timestamp. This is more useful because we need to be able to find a stream given timestamp and token.	2025-09-17 14:47:12 +02:00
Michael Litvak	aa61f074b5	cdc: helper functions for reading metadata from tables Define functions in system_keyspace that read from the internal cdc tables and construct the data into internal cdc data structures.	2025-09-17 14:47:12 +02:00
Michael Litvak	650ae30c97	cdc: colocate cdc table with base When creating a tablet map for a CDC table, make it be co-located with its base table. We modify db::get_base_table_for_tablet_colocation to return the base table id of a CDC table, handling both cases that the base table is a new table that's created in the same operation, or is an existing table in the db. This function is used by the tablet allocator to decide whether to create a co-located tablet map or allocate new tablets.	2025-09-17 14:47:12 +02:00
Michael Litvak	9ef0862155	cdc: remove streams when dropping CDC table When dropping a CDC log table in a tablets-enabled keyspace, remove all metadata about the table's CDC streams from the internal CDC tables, since the streams can't be read anymore. Similarly, when dropping a tablets-enabled keyspace, remove metadata of all streams belonging to tables in the keyspace.	2025-09-17 14:47:12 +02:00
Michael Litvak	ed25e420f8	cdc: create streams when allocating tablets When allocating tablets for a CDC table, create the initial CDC stream set. We create one stream per each tablet, each stream covering the corresponding token range.	2025-09-17 14:47:12 +02:00
Michael Litvak	7f2cd06bdc	migration_listener: add on_before_allocate_tablet_map notification Add a new notification on_before_allocate_tablet_map that is called when creating a tablet map for a new table and passes the tablet map. This will be useful next for CDC for example. when creating tablets for a new table we want to create CDC streams for each tablet in the same operation, and we need to have the tablet map with the tablet count and tokens for each tablet, because the CDC streams are based on that. We need to change slightly the tablet allocation code for this to work with colocated tables, because previously when we created the tablet map of a colocated table we didn't have a reference to the base tablet map, but now we do need it so we can pass it to the notification.	2025-09-17 14:47:11 +02:00
Michael Litvak	fdfe9ebb4c	cdc: notify when creating or dropping cdc table When creating a CDC table by updating an existing base table and enabling CDC, notify about the table creation so subscribers can act on it. This is needed in particular for notifying the tablet allocator when creating a CDC table so that it will allocate tablets for the CDC table. Also, when dropping a CDC table, notifying about the dropped table. This is needed for the tablet allocator to remove the tablet map of the CDC table.	2025-09-17 14:47:11 +02:00
Michael Litvak	fed1048059	cdc: move cdc table creation to pre_create When creating a new table with CDC enabled, we create also a CDC log table by adding the CDC table's mutations in the same operation. Previously, it works by the CDC log service subscribing to on_before_create_column_family and adding the CDC table's mutations there when being notified about a new created table. The problem is that when we create the tables we also create their tablet maps in the tablet allocator, and we want to created the two tables as co-located tables: we allocate a tablet map for the base table, and the CDC table is co-located with the base table. This doesn't work well with the previous approach because the notification that creates the CDC table is the same notification that the tablet allocator creates the base tablet map, so the two operations are independent, but really we want the tablet allocator to work on both tables together, so that we have the base table's schema and tablet map when we create the CDC table's co-located tablet map. In order to achieve this, we want to create and add the CDC table's schema, and only after that notify using before_create_column_families with a vector that contains both the base table and CDC table. The tablet allocator will then have all the information it needs to create the co-located tablet map. We move the creation of the CDC log table - instead of adding the table's mutations in on_before_create_column_family, we create the table schema and add it to the new tables vector in on_pre_create_column_families, which is called by the migration manager in do_prepare_new_column_families_announcement. The migration manager will then create and add all mutations for creating the tables, and notify about the tables being created together.	2025-09-17 14:47:11 +02:00
Michael Litvak	b9ee28eaab	cdc: add internal tables for cdc with tablets Add new group0-based tables in system keyspace to be used for cdc with tablets: * cdc_streams_state - describing "base" state of CDC streams for each table - an initial timestamp and a stream set. * cdc_streams_history - describing following committed stream sets by diffs (opened / closed streams) from the previous set.	2025-09-17 14:47:11 +02:00
Michael Litvak	5f1caebcc7	cdc: add cdc_with_tablets feature flag add a new feature flag cdc_with_tablets to protect the schema changes that are required for the CDC with tablets feature. we will also use it to allow start using CDC in tablets-based keyspaces only once all nodes are upgraded and support this feature.	2025-09-17 14:47:11 +02:00
Michael Litvak	daf200facb	cdc: add is_log_schema helper In few places we need to check whether a schema represents a CDC log table, and we do so by checking whether the table's partitioner is the CDC partitioner. Extract this logic to a new utility function to reduce code duplication and allow reuse.	2025-09-17 14:47:11 +02:00
Piotr Dulikowski	6a90a1fd29	Merge 'db/view/view_building_worker: split batch's data preparation and execution' from Michał Jadwiszczak The view building batch lives on shard0 but it might be doing work on shard which owns the tablet replica. Until now the batch data was accessed from multiple shards (shard0 and where the batch was executed). This patch fixes this by splitting tasks execution into: - preparation which is always happening on shard0 - actual execution of the tasks on relevant shard, but all necessary data is copied to the shard and batch object isn't accessed. Fixes https://github.com/scylladb/scylladb/issues/25804 View building coordinator hasn't been released yet, so no backport needed. Closes scylladb/scylladb#26058 * github.com:scylladb/scylladb: db/view/view_building_worker: move try-catch outside `invoke_on()` db/view/view_building_worker: split batch's data preparation and execution	2025-09-17 14:17:25 +02:00
Botond Dénes	30a3f61fa0	Merge 'compaction: handle exception in expected_total_workload' from Aleksandra Martyniuk expected_total_workload methods of scrub compaction tasks create a vector of table_info based on table names. If any table was already dropped, then the exception is thrown. It leaves table_info in corrupted state and node crashes with `free(): invalid size`. Return std::nullopt if an exception was thrown to indicate that total workload cannot be found. Fixes: #25941. No release branches affected Closes scylladb/scylladb#25944 * github.com:scylladb/scylladb: tasks: get progress of failed task based on children compaction: handle exception in expected_total_workload	2025-09-17 15:10:19 +03:00
Nadav Har'El	e322902506	Merge 'index, metrics: add per-index metrics' from Michał Hudobski This patch adds the possibility to track metrics per secondary index. Currently, only a histogram of query latencies is tracked, but more metrics can be added in the future. To add a new metric, it needs to be added to the index_metrics struct in index/secondary_index_manager.hh and then initialized in index/secondary_index_manager.cc in the constructor of the index_metrics struct. The metrics are created when the index is created and removed when the index is dropped. First lines of the new metric: \# HELP scylla_index_query_latencies Index query latencies \# TYPE scylla_index_query_latencies histogram scylla_index_query_latencies_sum{idx="test_i_idx",ks="test"} 640 scylla_index_query_latencies_count{idx="test_i_idx",ks="test"} 1 scylla_index_query_latencies_bucket{idx="test_i_idx",ks="test",le="640.000000"} 1 scylla_index_query_latencies_bucket{idx="test_i_idx",ks="test",le="768.000000"} 1 Fixes: https://github.com/scylladb/scylladb/issues/25970 Closes scylladb/scylladb#25995 * github.com:scylladb/scylladb: test: verify that the index metric is added index, metrics: add per-index metrics	2025-09-17 14:54:12 +03:00
Michał Chojnowski	b7afda5030	sstables/mx/reader: remove mx::make_reader_with_index_reader When `mx::make_reader` is used to construct an sstable reader, it constructs its own index reader internally. `mx::make_reader_with_index_reader` was originally added as a variant of `mx::make_reader` which can be used to inject a custom `index_reader` for testing that the mx Data reader tolerates inexact indexes. But now we want the ability to choose between BIG index readers and BTI index readers if both are present. And at this point, it seems to me that it makes sense to just construct the index reader in the caller and pass it via argument to `mx::make_reader` instead of putting the index selection inside it. So that's what we do in this patch. And we remove `mx::make_reader_with_index_reader` because it's no longer different from `mx::make_reader`.	2025-09-17 12:22:41 +02:00
Michał Chojnowski	f7d7722baa	test/boost/bti_index_test: fix indentation Fix an indentation mishap. Purely cosmetic patch.	2025-09-17 12:22:41 +02:00
Michał Chojnowski	191405fc51	sstables/trie/bti_index_reader: in last_block_offset(), return offset from the beginning of partition, not file Before this patch, `bti_index_reader::last_block_offset()` returns the offset of the last block within the file. But the old `index_reader::last_block_offset()` returns the offset within the partition, and that's what the callers (i.e. reversed sstable reader) expect. Fix `bti_index_reader::last_block_offset()` (and the corresponding comment and test) to match `index_reader::last_block_offset()`.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	1f85069389	sstables/trie: support reader_permit and trace_state properly Before this patch, `reader_permit` taken by `bti_index_reader`. wasn't actually being passed down to disk reads. In this patch, we fix this FIXME by propagating the permit down to the I/O operations on the `cached_file`. Also, it didn't take `trace_state_ptr` at all. In this patch, we add a `trace_state_ptr` argument and propagate it down to disk reads. (We combine the two changes because the permit and the trace state are passed together everywhere anyway).	2025-09-17 12:22:40 +02:00
Michał Chojnowski	c9b0dbc580	sstables/trie/bti_node_reader: avoid calling into `cached_file` if the target position is already cached Small optimization. In some places we call `load` on a position that is in the currently-held page. In those cases we are needlessly calling `cached_file::get_shared_page` for the same page again, adding some work and some noise in CQL tracing. This patch adds an `if` against that.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	caecc02d75	sstables/trie/bti_index_reader: get rid of the seastar::file wrapper in read_row_index_header Before this patch, the "stack" of wrappers in `read_row_index_header` is: - row_index_header_parser - continuous_data_consumer - input_stream - file_data_source_impl - cached_file_impl - cached_file::stream - cached_file The `cached_file_impl` and `file_data_source_impl` are superfluous. We don't need to pretent the `cached_file` is a `seastar::file`, in particular we don't need random reads. We can use `cached_file::stream` to provide buffers for `input_stream` directly. Note: we use the `cached_file::stream` without any size hints (readahead). This means that parsing a large partition key -- which spans many pages -- might require multiple serialized disk fetches. This could be improved (e.g. if the first two bytes of the entry, which contain the partition key, are in the cached page, we could make a size hint out of them) but we ignore this for now, under the assumption that multi-page partition keys are a fringe use case.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	98b7655d2b	sstables/trie/bti_index_reader: support BYPASS CACHE Before this patch, `bti_index_reader` doesn't have a good way to implement BYPASS CACHE. In this patch we add a way, similar to what `index_reader` does: we allow the caller to pass in the `cached_file` via a shared pointer. If the caller wants the loads done by the index reader to remain cached, he can pass in the `cached_file` owned by the `sstable`, shared by all caching index readers. If the caller doesn't want the loads to remain cached, he can pass in a fresh `cached_file` which will be privately owned by the index reader, and will be evicted when the index reader dies.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	95c93568f7	test/boost/bti_index_test: use read_bti_partitions_db_footer where appropriate Use the helper instead of of reading the needed footer field "manually".	2025-09-17 12:22:40 +02:00
Michał Chojnowski	5934639c4b	sstables/trie: change the signature of bti_partition_index_writer::finish Let's return `bti_partitions_db_footer` so that it can be directly saved to `sstables::shareable_components` after the index write is finished, without re-reading the footer from the file. Let's take `const sstables::key&` arguments instead of `disk_string_view<uint16_t>`, that's more natural.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	0b0f6a1cc4	sstables/bti_index: improve signatures of special member functions in index writers Just a bit of constructor boilerplate: noexcept, move constructors, `bool` operators for `optimized_optional`.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	421fb8e722	streaming/stream_transfer_task: coroutinize `estimate_partitions()` In preparation for making `sstable::estimated_keys_for_range` asynchronous.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	bf90018b8e	types/comparable_bytes: add a missing implementation for date_type_impl date_type_impl is like timestamp_type_impl, but unsigned.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	46d8fd5bbd	sstables: remove an outdated FIXME Bloom filters were implemented 10 years ago. We can remove this FIXME now.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	6cb6c1e400	storage_service: delete `get_splits()` Dead code. Thrift API leftovers.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	e8f86315c7	sstables/trie: fix some comment typos in bti_index_reader.cc Spell checkers are complaining.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	416f5d64d4	sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone With `tomb` it's not obvious enough what tombstone this field is about.	2025-09-17 12:22:40 +02:00
Szymon Malewski	776f90e2f8	alternator/expressions.g: Fix antlr3 missing token leak This patch overrides the antlr3 function that allocates the missing tokens that would eventually leak. The override stores these tokens in a vector, ensuring memory is freed whenever the parser is destroyed. Solution is copied from CQL implementation. A unit test to reproduce the issue is added - leak would be reported by ASAN, when running this test in debug mode - the test passed but the leak is discovered when the test file exits. Fixes #25878 Closes scylladb/scylladb#25930	2025-09-17 13:05:24 +03:00
Abhinav Jha	43656371cf	raft_topology: Modify the conditional logic in remove node operation to enhance concurrency for raft enabled clusters. In the current scenario, the shard receiving the remove node REST api request performs condional lock depending on whether raft is enabled or not. Since non-zero shard returns false for `raft_topology_change_enabled()`, the requests routed to non zero shards are prone to this lock which is unnecessary and hampers the ability to perform concurrent operations, which is possible for raft enabled nodes. This pr modifies the conditional lock logic and orchestrates the remove node execution logic directly to the shard0, hence the `raft_topology_change_enabled()` is now checked on the shard0 and execution is performed accordingly. A test is also added to confirm the new behaviour, where concurrent remove node operations are now being performed seamlessly. This pr doesn't fix a critical bug. No need to backport it. Fixes: scylladb/scylladb#24737	2025-09-17 15:23:32 +05:30
Botond Dénes	2fa0f82910	tools: extract json_mtuation_stream_parser to its own hh,cc files tools/scylla-sstable.cc has 3.5k SLOC, out of which this class alone is 1K. Extract into own hh and cc, just a copy-paste after the preparation commit.	2025-09-17 12:18:07 +03:00
Botond Dénes	ffe8918522	tools/scylla-sstable: fix indentation Left broken by previous patch.	2025-09-17 12:16:22 +03:00
Botond Dénes	8c36a983cc	tools/scylla-sstable: prepare for extracting json_mutation_stream_parser Make methods out-of-line, so class declaration stands on its own, without definition of impl. Move auxiliary structures, used only by impl, out of the class scope. Move parser to tools namespace, and auxiliaries to anonymous namespace within the tools one. Pass down logger ref to parser impl and below, to prepare for sst_log not being available in scope. Add comment to parser class explaining what it does.	2025-09-17 12:16:21 +03:00
Benny Halevy	3a6208b319	utils: stall_free: clear_gently: release wrapped objects As discussed in https://github.com/scylladb/scylladb/pull/24606#discussion_r2281870939 clear_gently of shared pointers should release the wrapped object reference and when the object's use_count reaches 1, the object itself would be cleared_gently, before it's destroyed. This behavior is similar to the way we clear gently containers like arrays or vectors, and so it is extended in this patch to smart pointers like unique_ptr and foreign_ptr. The unit tests are adjusted respectively to expect the smart pointers to be reset after clear_gently, plus the use of `reset()` for `foreign_ptr<shared_ptr<>>` was replaced by `clear_gently().get()` which now ensures the reference to a shared object is released, and awaited for, if it happens on a foreign owner shard, unlike reset of a foreign_ptr that kicks off destroy of that shared object in the background on the owner shard - causing flakiness. Fixes #25723 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25759	2025-09-17 11:44:26 +03:00
Patryk Jędrzejczak	454eb08cb4	Merge 'group0: remove obsolete "stop_before_becoming_raft_voter" error injection' from Emil Maskovsky The Raft topology workflow was changed by the limited voters feature: nodes no longer request votership themselves. As a result, the "stop_before_becoming_raft_voter" error injection is now obsolete and has been removed. Fixes: scylladb/scylladb#23418 No backport: This re-enables a test, only needed for master. Closes scylladb/scylladb#26042 * https://github.com/scylladb/scylladb: group0: remove obsolete "stop_before_becoming_raft_voter" error injection test/random_failures: preserve test repeatability when removing error injections	2025-09-17 10:38:32 +02:00
Michał Jadwiszczak	d98237b33c	db/view/view_building_worker: move try-catch outside `invoke_on()` It's just stylist change, to me doing `invoke_on()` in try-catch block looks better than the other way.	2025-09-16 23:15:44 +02:00
Michał Jadwiszczak	9458ceff8f	db/view/view_building_worker: split batch's data preparation and execution The view building batch lives on shard0 but it might be doing work on shard which owns the tablet replica. Until now the batch data was accessed from multiple shards (shard0 and where the batch was executed). This patch fixes this by splitting tasks execution into: - preparation which is always happening on shard0 - actual execution of the tasks on relevant shard, but all necessary data is copied to the shard and batch object isn't accessed. Fixes scylladb/scylladb#25804	2025-09-16 23:13:36 +02:00
Patryk Jędrzejczak	368d70ee15	Merge 'LWT: implement fencing' from Petr Gusev This PR consists of three parts: * Small refactoring of the fencing APIs in storage_proxy (renames + comments + some functions were extracted) * Implement the fencing for LWT verbs itself. This includes checking the fencing token before and after local replica data accesses. * Two new `test.py` tests in `test_fencing.py`, which check the fencing in some real-world scenarios. Backport: no need -- fencing for LWT requests is needed primarily for LWT over tablets, which is not released yet. Fixes scylladb/scylladb#22332 Closes scylladb/scylladb#25550 * https://github.com/scylladb/scylladb: test_tablets_lwt: eliminate redundant disable_tablet_balancing test_fencing: add test_lwt_fencing_upgrade pylib: extract upgrade helpers from test_sstable_compression_dictionaries_upgrade.py test_fencing: add test_fenced_out_on_tablet_migration_while_handling_paxos_verb test_fencing: test_fence_lwt_during_bootstap pylib/rest_client.py: encode injection name storage_proxy_stats: add fenced_out_requests metric storage_proxy: add fencing to Paxos verbs storage_proxy::apply_fence: add overload that throws on failure storage_proxy: extract apply_fence_result sp::apply_fence: rename to apply_fence_on_ready sp::apply_fence: rename to check_fence sp::apply_fence: make non-generic	2025-09-16 23:40:48 +03:00
Ernest Zaslavsky	d624413ddd	treewide: Move query related files to a new `query` directory As requested in #22120, moved the files and fixed other includes and build system. Moved files: - query.cc - query-request.hh - query-result.hh - query-result-reader.hh - query-result-set.cc - query-result-set.hh - query-result-writer.hh - query_id.hh - query_result_merger.hh Fixes: #22120 This is a cleanup, no need to backport Closes scylladb/scylladb#25105	2025-09-16 23:40:47 +03:00
Pavel Emelyanov	6fb66b796a	s3: Add metrics to show S3 prefetch bytes The chunked download source sends large GET requests and then consumes data as it arrives. Sometimes it can stop reading from socket early and drop the in-flight data. The existing read-bytes metrics show only the number of consumed bytes, we we also want to know the number of requested bytes Refs #25770 (accounting of read-bytes) Fixes #25876 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25877	2025-09-16 23:40:47 +03:00
Sergey Zolotukhin	2640b288c2	raft: disable caching for raft log. This change disables caching for raft log table due to the following reasons: * Immediate reason is a deficiency in handling emerging range tombstones in the cache, which causes stalls. * Long-term reason is that sequential reads from the raft log do not benefit from the cache, making it better to bypass it to free up space and avoid stalls. Fixes scylladb/scylladb#26027 Closes scylladb/scylladb#26031	2025-09-16 23:40:47 +03:00
Pavel Emelyanov	d69a51f42a	compaction: Use function when filtering compaction tasks for stopping The compaction_manager::stop_compaction() method internally walks the list of tasks and compares each task's compacting_table (which is compaction group view pointer) with the given one. In case this stop_compaction() method is called via API for a specific table, the method walks the list of tasks for every compaction group from the table, thus resulting in nr_groups * nr_tasks complexity. Not terrible, but not nice either. The proposal is to pass filtering function into the inner do_stop_ongoing_compactions() method. Some users will pass a simple "return true" lambda, but those that need to stop compactions for a specitif table (e.g. -- the API handler) will effectively walk the list of tasks once comparing the given compaction group's schema with the target table one (spoiler: eventually this place will also be simplified not to mess with replica::table at all). One ugliness with the change is the way "scope" for logging message is collected. If all tasks belong to the same table, then "for table ..." is printed in logs. With the change the scope is no longer known instantly and is evaluated dynamically while walking the list of tasks. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25846	2025-09-16 23:40:47 +03:00
Michał Chojnowski	68e6141211	scylla-gdb: add `scylla prepared-statements` Add a helper which prints all prepared statements currently present in the query processor. Example output: ``` (gdb) scylla prepared-statements (cql3::cql_statement)(0x600003d71050): SELECT FROM ks.ks WHERE pk = ? (cql3::cql_statement*)(0x600003972b50): SELECT pk FROM ks.ks WHERE pk = ? ``` Closes scylladb/scylladb#26007	2025-09-16 23:40:47 +03:00
Botond Dénes	0cf6a648bb	Merge 'Default create keyspace syntax' from Dario Mirovic Allow for the following CQL syntax: ``` CREATE KEYSPACE [IF NOT EXISTS] <name>; ``` for example: ``` CREATE KEYSPACE test_keyspace; ``` With this syntax all the keyspace's parameters would be defaulted to: replication strategy = `NetworkTopologyStrategy`, replication factor = number of racks , but excluding racks that only have arbiter nodes storage options, durable writes = defaults we normally would use, tablets enabled if they are enabled in the db configuration, e.g. scylla.yaml or db/config.cc by default. Options besides `replication` already have defaults. `replication` had to be specified, but it could be an empty set, where defaults for sub-options (replication strategy and replication factor) would be used - `replication = {}`. Now there is no need for specifying an empty set - omitting `replication = {}` has the same effect as `replication = {}`. Since all the options now have defaults, `WITH` is optional for `CREATE KEYSPACE` statement. Fixes #25145 This is an improvement, no backport needed. Closes scylladb/scylladb#25872 * github.com:scylladb/scylladb: docs: cql: default create keyspace syntax test: cqlpy: add test for create keyspace with no options specified cql: default `CREATE KEYSPACE` syntax	2025-09-16 23:40:47 +03:00
Emil Maskovsky	943af1ef1c	topology_coordinator: consistently rethrow `raft::request_aborted` for direct/global commands Ensure all direct and global topology commands rethrow the `raft::request_aborted` exception when aborted, typically due to leadership changes. This makes abortion explicit to callers, enabling proper handling such as retries or workflow termination. This change completes the work started in PR scylladb/scylladb#23962, covering all remaining cases where the exception was not rethrown. Fixes: scylladb/scylladb#23589 No backport: No related issues observed in previous versions; backport not required. Closes scylladb/scylladb#26021	2025-09-16 23:40:47 +03:00
Emil Maskovsky	87bd328873	group0: remove obsolete "stop_before_becoming_raft_voter" error injection The Raft topology workflow was changed by the limited voters feature: nodes no longer request votership themselves. As a result, the "stop_before_becoming_raft_voter" error injection is now obsolete and has been removed. Fixes: scylladb/scylladb#23418	2025-09-16 18:24:27 +02:00
Emil Maskovsky	0453052d66	test/random_failures: preserve test repeatability when removing error injections The order of entries in the ERROR_INJECTIONS list determines test repeatability for a given random seed. To allow removing error injections without affecting the order of the remaining ones, removed injections are now renamed with a "REMOVED_" prefix instead of being deleted. This ensures they are ignored by the tests, while the sequence of active injections—and thus test reproducibility—remains unchanged.	2025-09-16 18:22:45 +02:00
Michał Hudobski	3364cc96f5	test: verify that the index metric is added This commit adds a test that performs a sanity check that the implemented metric is actually being added to Scylla's metrics and has the correct value.	2025-09-16 18:10:01 +02:00
Aleksandra Martyniuk	3324f08e9c	tasks: get progress of failed task based on children Currently, for failed tasks task_manager::task::impl::get_progress attempts to find expected_total_workload. However, if the task has finished long time ago, the state might have totally changed, e.g. some tables might have been dropped or have changed their sizes. Due to that, the result of expected_total_workload might be irrelevant. Count the progress of a finish task based on children only, regardless whether the task has succeeded or failed.	2025-09-16 17:15:01 +02:00
Aleksandra Martyniuk	17e9ec11d7	db: fix indentation	2025-09-16 14:49:54 +02:00
Aleksandra Martyniuk	0024339a71	db: cache: consider preempting after each partition Currently, during cache invaldation we check if we need to preempt only after the partition gets invaldaited. This may lead to stalls if we have a chain of filtered out partitions. Check for preemption even if the partition does not get invaldated. Refs: #9136.	2025-09-16 14:45:28 +02:00
Cezar Moise	e9be1e7b35	test: cleanup big mutation commitlog tests - fix typos - improve comments - remove false and misleading comments - remove `disableautocompaction` as it did nothing for the test and the comment with it was false	2025-09-16 15:33:23 +03:00
Cezar Moise	492b4cf71c	test: fix test_one_big_mutation_corrupted_on_startup The commitlog in the tests with big mutations were corrupted by overwriting 10 chunks of 1KB with random data, which could not be enough due to randomness and the big size of the commitlog (~65MB). Change `corrupt_file` to overwrite based on a percentage of the file's size instead of fixed number of chunks. refs: #25627	2025-09-16 15:32:44 +03:00
Nikos Dragazis	58e8142a06	utils: azure: Catch system errors when detecting IMDS When the default credential provider probes IMDS to check its availability, it assumes that application-level connection timeouts are the only error that can occur when the node is not an Azure VM, i.e., the packets will be silently dropped somewhere in the network. However, this has proven not always true for the `test_azure_provider_with_incomplete_creds` unit test, which overrides the default IMDS endpoint with a non-routeable IP from TEST-NET-1 [1]. This test has been reported to fail in some local setups where routers respond with ICMP "host unreachable" errors instead of silently dropping the packets. This error propagates to user space as an EHOSTUNREACH system error, which is not caught by the default credential provider, causing the test to fail. The reason we use a non-routeable address in this test is to ensure that IMDS probing will always fail, even if running the test on an Azure VM. Theoretically, the same problem applies to the default IMDS endpoint as well (169.254.169.254). The RFC 3927 [2] mandates that packets targeting link-local addresses (169.254/16) must not be forwarded, but the exact behavior is left to implementation. Since we cannot predict how routers will behave, fix this by catching all relevant system errors when probing IMDS. [1] https://datatracker.ietf.org/doc/html/rfc5735 [2] https://datatracker.ietf.org/doc/html/rfc3927 Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-16 15:27:59 +03:00
Nikos Dragazis	78bcecd570	utils: azure: Bump default credential logs from DEBUG to INFO The default credential provider produces diagnostic logs on each step as it walks through the credential chain. These logs are useful for operators to diagnose authentication problems as they expose information about which credential sources are being evaluated, in which order, why they fail, and which source is eventually selected. Promote them from DEBUG to INFO level. Additionally, concatenate the logs for environment credentials into a single log statement to avoid interleaving with other logs. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-16 15:20:52 +03:00
Michał Hudobski	b09d1f0a98	index, metrics: add per-index metrics This patch adds the possibility to track metrics per secondary index. Currently, only a histogram of query latencies is tracked, but more metrics can be added in the future. To add a new metric, it needs to be added to the index_metrics struct in index/secondary_index_manager.hh and then initialized in index/secondary_index_manager.cc in the constructor of the index_metrics struct. The metrics are created when the index is created and removed when the index is dropped. First lines of the new metric: \# HELP scylla_index_query_latencies Index query latencies \# TYPE scylla_index_query_latencies histogram scylla_index_query_latencies_sum{idx="test_i_idx",ks="test"} 640 scylla_index_query_latencies_count{idx="test_i_idx",ks="test"} 1 scylla_index_query_latencies_bucket{idx="test_i_idx",ks="test",le="640.000000"} 1 scylla_index_query_latencies_bucket{idx="test_i_idx",ks="test",le="768.000000"} 1	2025-09-16 14:03:43 +02:00
Lakshmi Narayanan Sreethar	7cdda510ee	compaction/scrub: register sstables for compaction before validation When `scrub --validate` runs, it collects all candidate sstables at the start and validates them one by one in separate compaction tasks. However, scrub in validate mode does not register these sstables for compaction, which allows regular compaction to pick them up and potentially compact them away before validation begins. This leads to scrub failures because the sstables can no longer be found. This patch fixes the issue by first disabling compaction, collecting the sstables, and then registering them for compaction before starting validation. This ensures that the enqueued sstables remain available for the entire duration of the scrub validation task. Fixes #23363 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-16 15:29:57 +05:30
Abhinav Jha	00327bdfd0	storage_service: remove assumptions and checks for ignore_nodes to be normal. The ignore nodes param should not be required to contain only normal nodes. This commit removes such assumptions and checks. Although checks to ensure that ignore nodes are present in topology is still there and error is thrown if such irrelevant unrelated node is added in ignore_nodes.	2025-09-16 15:00:11 +05:30
Patryk Jędrzejczak	9efe250a8f	Merge 'gossiper: ensure gossiper operations are executed in gossiper scheduling group' from Sergey Zolotukhin Sometimes gossiper operations invoked from storage_service and other components run under a non-gossiper scheduling group. If these operations acquire gossiper locks, priority inversion can occur: higher-priority gossiper tasks may wait behind lower-priority tasks (e.g. streaming), which can cause gossiper slowness or even failures. This patch ensures that gossiper operations requiring locks on gossiper structures are explicitly executed in the gossiper scheduling group. To help detect similar issues in the future, a warning is logged whenever a gossiper lock is acquired under a non-gossiper scheduling group. Fixes scylladb/scylladb#25907 Refs: scylladb/scylladb#25702 Backport: this patch fixes an issue with gossiper operations scheduling group, that might affect topology operations, therefore backport is needed to 2025.1, 2025.2, 2025.3 Closes scylladb/scylladb#25981 * https://github.com/scylladb/scylladb: gossiper: ensure gossiper operations are executed in gossiper scheduling group gossiper: fix wrong gossiper instance used in `force_remove_endpoint`	2025-09-16 10:14:15 +02:00
Asias He	54162a026f	scylla-nodetool: Add --incremental-mode option to cluster repair The `--incremental-mode` option specifies the incremental repair mode. Can be 'disabled', 'regular', or 'full'. 'regular': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to regular. Fixes #25931 Closes scylladb/scylladb#25969	2025-09-16 10:23:22 +03:00
Botond Dénes	ee7c85919e	Revert "treewide: seastar module update and fix broken rest client" This reverts commit `44d34663bc` of PR https://github.com/scylladb/scylladb/pull/25915. Breaks articact tests on ARM, blocking us from building new images from master.	2025-09-16 08:31:08 +03:00
Lakshmi Narayanan Sreethar	84f2e99c05	compaction/scrub: handle exceptions when moving invalid sstables to quarantine In validate mode, scrub moves invalid sstables into the quarantine folder. If validation fails because the sstable files are missing from disk, there is nothing to move, and the quarantine step will throw an exception. Handle such exceptions so scrub can return a proper compaction_result instead of propagating the exception to the caller. This will help the testcase for #23363 to reliably determine if the scrub has failed or not. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-15 23:10:23 +05:30
Sergey Zolotukhin	6c2a145f6c	gossiper: ensure gossiper operations are executed in gossiper scheduling group Sometimes gossiper operations invoked from storage_service and other components run under a non-gossiper scheduling group. If these operations acquire gossiper locks, priority inversion can occur: higher-priority gossiper tasks may wait behind lower-priority tasks (e.g. streaming), which can cause gossiper slowness or even failures. This patch ensures that gossiper operations requiring locks on gossiper structures are explicitly executed in the gossiper scheduling group. To help detect similar issues in the future, a warning is logged whenever a gossiper lock is acquired under a non-gossiper scheduling group. Fixes scylladb/scylladb#25907	2025-09-15 12:49:07 +02:00
Petr Gusev	1d270020f2	test_tablets_lwt: eliminate redundant disable_tablet_balancing This is a refactoring commit.	2025-09-15 12:40:10 +02:00
Petr Gusev	7060265d5f	test_fencing: add test_lwt_fencing_upgrade This test verifies that upgrading to a Scylla version with LWT fencing does not disrupt existing LWT workloads.	2025-09-15 12:34:45 +02:00
Petr Gusev	49b036cf2b	pylib: extract upgrade helpers from test_sstable_compression_dictionaries_upgrade.py We want to reuse them to test upgade for LWT fencing	2025-09-15 12:34:45 +02:00
Petr Gusev	82f0235e4b	test_fencing: add test_fenced_out_on_tablet_migration_while_handling_paxos_verb This test verifies that the fencing token is checked on replicas after the local Paxos state is updated. This ensures that if we failed to drain an LWT request during topology changes the replicas where paxos verbs got stuck won't contributed to the target CLs.	2025-09-15 12:34:45 +02:00
Petr Gusev	0156850605	test_fencing: test_fence_lwt_during_bootstap	2025-09-15 12:09:08 +02:00
Dawid Mędrek	18cb748268	docs/snitch: Document default DC and rack The existing article is already extensive and covers pretty much all of the details useful to the user. However, the document lacked minute information like the default names of the DC and rack in case of SimpleSnitch or it didn't explicitly specify the behavior of RackInferringSnitch (though arguably the existing example was more than sufficient). Fixes scylladb/scylladb#23528 Closes scylladb/scylladb#25700	2025-09-15 11:47:22 +02:00
Petr Gusev	92b165b8c0	pylib/rest_client.py: encode injection name Sometimes it's convenient to use slashes in injection names, for example my_component/my_method/my_condition. Without quote() we get 'handler not found' error from Scylla.	2025-09-15 11:24:53 +02:00
Petr Gusev	819d59eeba	storage_proxy_stats: add fenced_out_requests metric We have to drop const qualifiers because now check_fence needs to mutate this metric.	2025-09-15 11:24:53 +02:00
Petr Gusev	6d7af84fed	storage_proxy: add fencing to Paxos verbs This commit adds fencing support to all Paxos verbs: * Pass an optional (for backward compatibility) fencing_token as a parameter to the prepare, accept, learn, and prune verbs. * Call apply_fence twice — before and after accessing local data. This ensures that if the coordinator is fenced out mid-request, the replica does not return success, which would otherwise incorrectly contribute to achieving the target CL. Without this, a user might observe successful writes that become unreadable after the topology operation completes. * For prune, call apply_fence only once because it does not return a response to the LWT coordinator. Fixes scylladb/scylladb#22332	2025-09-15 11:24:53 +02:00
Petr Gusev	ab750af711	storage_proxy::apply_fence: add overload that throws on failure This new apply_fence overload checks the fence and reports a failure by throwing a regular exception.	2025-09-15 11:24:53 +02:00
Petr Gusev	a2bde28efe	storage_proxy: extract apply_fence_result This commit refactors a repeated pattern that applies the fence and embeds the exception into the exception_variant class by extracting it into a separate method.	2025-09-15 11:24:53 +02:00
Petr Gusev	bdfea2fa4c	sp::apply_fence: rename to apply_fence_on_ready This overload performs the fence check only when the future is ready. In this commit, we give it a more descriptive name to better reflect its behavior. Additionally, we add extensive comments explaining the overall fencing scheme and the motivation behind this specific overload.	2025-09-15 11:24:53 +02:00
Petr Gusev	4a5c856d44	sp::apply_fence: rename to check_fence We plan to introduce several additional apply_fence overloads in upcoming commits. To avoid ambiguity, this change renames the existing base function to check_fence.	2025-09-15 10:56:20 +02:00
Petr Gusev	7fb5b2006b	sp::apply_fence: make non-generic It's simpler and more consistent to always use locator::host_id for caller_address. We also slightly reformulate the comment for sp.apply_fence here.	2025-09-15 10:56:20 +02:00
Michał Jadwiszczak	dc1ffd2c10	service/storage_service: drain `view_building_worker` earlier Similarly to view builder, view building worker needs to be drained in `storage_service::do_drain()`. Storage service drain is happening at the same beginning of shutdown procedure. Before this patch, the worker was still building views after the storage service was drained and this caused errors like: `Error applying view update to (named_gate_closed_exception)` and `locator::no_such_tablet_map`. Fixes scylladb/scylladb#25908 Closes scylladb/scylladb#25984	2025-09-15 11:29:19 +03:00
Gleb Natapov	d3badf7406	storage_service: change node_ops_info::ignore_nodes to host id It drop useless translation from id to ip during removenode through topology coordinator. Closes scylladb/scylladb#25958	2025-09-15 10:18:24 +02:00
Sergey Zolotukhin	340413e797	gossiper: fix wrong gossiper instance used in `force_remove_endpoint` `gossiper::force_remove_endpoint` is always executed on shard 0 using `invoke_on`. Since each shard has its own `gossiper` instance, if `force_remove_endpoint` is called from a shard other than shard 0, `my_host_id()` may be invoked on the wrong `gossiper` object. This results in undefined behavior due to unsynchronized access to resources on another shard.	2025-09-15 08:54:59 +02:00
Aleksandra Martyniuk	55fde70f8d	api: tasks: task_manager: keep children identities in chunked_{array,vector} task_status contains a vector of children identities. If the number of children is large, we may hit oversized allocation. Change all types of children-related containers to chunked_vector. Modify the children type returned from task manager API. Fixes: scylladb#25795. Closes scylladb/scylladb#25923	2025-09-15 08:44:16 +03:00
Nadav Har'El	b4e3d4ac2f	alternator: nicer error message for integer overflow in list index In the DynamoDB API, when "a" is a list attribute, a[999] returns the 1000th element. But if the list isn't that long (e.g., it only has 5 elements), a[999] returns nothing - it's not an error. But it turns out that when the index is so long that it can't even be parsed as an integer, e.g., 99999999999999, DynamoDB does report an error: Invalid ProjectionExpression: List index is not within the allowable range; index: [99999999999999] Before this patch, Alternator also returned an error in this case, with the right type (ValidationException), but with a strange low-level error text: Failed parsing ProjectionExpression 'a[99999999999999]': std::out_of_range (stoi) The problem was that the code (in alternator/expressions.g) ran stoi() without converting its std::out_of_range exception to a better user-facing message. We do this in this patch, and the error message now looks like: Failed parsing ProjectionExpression 'a[99999999999999]': list index out of integer range This patch also includes a test reproducing this error, which passes on DynamDB and on Alternator it fails before this patch and passes with the patch. Fixes #25947 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25951	2025-09-15 08:43:00 +03:00
Nadav Har'El	208d3986a7	alternator: add explanation of internal tags Alternator needs to store a few pieces of information for each table that it can't store in the existing CQL schema. We decided to store this information in hidden tags - tags named with the prefix "system:" - and we already have four of those: Provisioned RCU and WCU, table creation time, and TTL's expiration-time attribute. This patch moves the definition of all four tags to one place in executor.cc, adds a short comment about the content of each tag, and adds a longer comment explaining why we have these hidden tags at all. It is expected that more hidden tags will follow - e.g., to solve issue #5320. So we expect more tags to be added later in the same place in the code. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25980	2025-09-15 08:41:39 +03:00
Emil Maskovsky	99db980899	gossiper: eliminate duplicate code in do_shadow_round Remove a redundant code block inadvertently introduced in commit `4b3d160f34`. While the duplicate did not affect functionality, its presence could cause confusion and maintenance issues. This change does not alter behavior and is purely a cleanup. Fixes: scylladb/scylladb#25999 Backport: The issue exists in all 2025 branches, so it should be backported accordingly. Closes scylladb/scylladb#26001	2025-09-15 08:35:04 +03:00
Jenkins Promoter	c63b335819	Update pgo profiles - aarch64	2025-09-15 05:17:07 +03:00
Jenkins Promoter	e97a0c8b42	Update pgo profiles - x86_64	2025-09-14 21:23:37 -04:00
Aleksandra Martyniuk	75b772adfb	db: optimize cache invalidation following repair/streaming Currently, if a new sstable is created during repair/streaming, we invalidate its whole token range in cache. If the sstable is sparse, we unnecessarily clear too much data. Modify cache invalidation, so that only the partitions present in the sstable are cleared. To check whether a partition is present in the sstable, we use bloom filters. Bloom filters may return false positives and show that an sstable contains a partition, even though it does not. Due to that we may invalidate a bit more than we need to, but the cache will be in valid state. An issue arises when we do not invalidate two consecutive partitions that are continuous. The sstable may contain a token that falls between these partitions, breaking the continuity. To check that, we would need to scan sstable index. However, such a change would noticeably complicate the invalidation, both performance and code. In this change, sstable index reader isn't used. Instead, the continuity flag is unset for all scanned partitions. This comes at a cost of heavier reads, as we will need to verify continuity when reading more than one partition from cache. Fixes: https://github.com/scylladb/scylladb/issues/9136. Closes scylladb/scylladb#25996	2025-09-14 19:48:14 +03:00
Lakshmi Narayanan Sreethar	1d1e572962	sstables: skip bloom filter rebuilds with minimal savings If a bloom filter was built with a bad partition estimate, it is rebuilt right before the sstable is sealed. The rebuild is already skipped if the current bitset size results in a false-positive rate within 75%–125% of the configured value. This patch adds additional conditions to prevent rebuilds when the savings are minimal. It also skips rebuilding for garbage collected sstables, since they will be dropped soon anyway. Also updated and added more test cases to cover these new criteria for bloom filter rebuilds. Fixes #25464 Fixes #25468 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#25968	2025-09-14 18:19:50 +03:00
Nadav Har'El	5307d1b9a8	Merge 'vector_index: add version to index options' from Dawid Pawlik Since creating the vector index does not lead to creation of a view table [#24438] (whose version info had been logged in `system_schema.scylla_tables`) we lacked the information about the version of the index. The solution we arrived at is to add the version as a field in options column of `system_schema.indexes`. It requires few changes and seems unintruitive for existing infrastructure. This patch implements the solution described above. Refs: VECTOR-142 Closes scylladb/scylladb#25614 * github.com:scylladb/scylladb: cqlpy/test_vector_index: add vector index version test vector_index, index_prop_defs: add version to index options create_index_statement: rename `validator` to `custom_index_factory` custom index: rename `custom_index_option_name` vector_index: rename `supported_options` to `vector_index_options`	2025-09-14 15:35:53 +03:00
Radosław Cybulski	30306c3375	Remove const & from tags_extension constructor `tags_extension` constructor unnecesarily takes `std::map` by const ref, forcing a copy. This patch removes const ref for performance reasons. Closes scylladb/scylladb#25977	2025-09-14 13:32:21 +03:00
Ernest Zaslavsky	44d34663bc	treewide: seastar module update and fix broken rest client start using `write_body` in `rest/client` to properly set headers due to changes applied to seastar's http client Seastar module update ``` b6be384e Merge 'http: generalize Content-Type setting' from Nadav Har'El 74472298 http: generalize request's Content-Type setting 9fd5a1cc http: generalize reply's Content-Type setting a2665f38 memory: Remove deprecated enable_abort_on_allocation_failure() d2a5a8a9 resource.cc: Remove some dead code 7ad9f424 http: Add support of multiple key repetitions for the request a636baca task: Move task::get_backtrace() definition in its class a0101efa Fixed "doxygen" spelling in error message db969482 Merge 'http/reply: introduce set_cookie()' from Botond Dénes 5357b434 http/reply: introduce set_cookie() 1ddcf05f http/reply: make write_reply*() public 4b782d73 http/connection: start_response(): fix indentation 720feca0 http/reply: encapsulate reply writing in write_reply() 3e19917d Merge 'exceptions: log thrown and propagated exception with distinct log levels' from Botond Dénes db9aea93 Merge 'Correctly wrap up abandoned yielding directory lister' from Pavel Emelyanov dbb2bf3f test: Add test for input_stream::read_exactly() a5308ec9 file/directory_lister: Correctly wrap up fallback generator 4f0811f4 file/directory_lister: Convert on-stack queue to shared pointer 59801da7 tests: Add directory lister early drop cases 33233032 http/reply: s/write_reply_to_connection/write_reply/ 69b93620 http/reply: write_reply_{to_connection,headers}(): pass output stream 56e9bda7 test: Convert directory_test into seastar test 96782358 Merge 'Improve io_tester's seqwrite and append workloads' from Pavel Emelyanov 8b46e3d4 SEASTAR_ASSERT: assert to stderr and flush stream 3370e22a tutorial.md: use current_exception_as_future() e977453a Add fixture support for seastar::testing 3e70d7f7 io_tester: Do not set append_is_unlikely unconditionally 2a4ae7b4 io_tester: Count file size overflows 5e678bb5 io_tester: Tuneup size overflow check d5dad8ce io_tester: Move position management code to io_class_data 5586a056 io_tester: Rename seqwrite -> overwrite 92df2fb2 io_tester: Relax return value of create_and_fill_file() 03d9500d io_tester: Dont fill file for APPEND d6844a7b io_tester: Indentation fix after previous patch fb9e0088 io_tester: Coroutinize create_and_fill_file() 2f802f57 exceptions: log thrown and propagated exception with distinct log levels 4971fa70 util: move log-level into own header 39448fc1 Merge 'Fix and tune http::request setup by client' from Pavel Emelyanov 52d0c4fb iostream: Move output_stream::write(scattered_message) lower 7a52f734 Merge 'read_first_line: Missing pragma and licence' from Ernest Zaslavsky d0881b7e read_first_line: Add missing license boilerplate 988a0e99 read_first_line:: Add missing `#pragma once` 42675266 http: Make client::make_request accept const request& c7709fb5 http: Make request making API return exceptional future not throw b68ed89b http: Move request content length header setup 1d96dac6 http: Move request version configuration 072e86f6 http: Setup request once ``` Closes scylladb/scylladb#25915	2025-09-13 17:14:28 +03:00
Asias He	9bca90be0d	repair: Fix repair_row_level_stop verb idl The version keyword is missed for the optional mark_as_repaired parameter. This causes the new node to expect more data to come: INFO 2025-09-01 19:23:05,332 [shard 0:strm] rpc - client 127.0.7.6:50116 msg_id 8: caught exception while processing a message: std::out_of_range (deserialization buffer underflow) When the sender is an old node in a mixed cluster, the data will never come. To fix, add the missing version keyword. Our idl-compiler.py should have caught the typo since the keyword was missing in the [[]] tag. Fixes #25666 Closes scylladb/scylladb#25782	2025-09-12 15:58:19 +03:00
Avi Kivity	ef7babda3d	Merge 'test: deflake test_restart_leaving_replica_during_cleanup' from Patryk Jędrzejczak The test started hitting #21779 recently. We deflake it in this commit by disabling the tablet load balancing before dropping the keyspace at the end of the test. We still have to understand why the test started hitting #21779, so we keep #25938 open. Refs #25938 The test was flaky only on master, so no backport needed. Closes scylladb/scylladb#25975 * github.com:scylladb/scylladb: test: enable load balancing on a single node in test_restart_leaving_replica_during_cleanup test: deflake test_restart_leaving_replica_during_cleanup	2025-09-12 15:58:19 +03:00
Sayanta Banerjee	6092520631	Small grammatical changes Closes scylladb/scylladb#24667	2025-09-12 15:58:19 +03:00
Radosław Cybulski	436150eb52	treewide: fix spelling errors Fix spelling errors reported by copilot on github. Remove single use namespace alias. Closes scylladb/scylladb#25960	2025-09-12 15:58:19 +03:00
Patryk Jędrzejczak	aaab71c14e	test: enable load balancing on a single node in test_restart_leaving_replica_during_cleanup Doing it on more than one node is redundant.	2025-09-11 13:19:56 +02:00
Patryk Jędrzejczak	4c9efc08d8	test: deflake test_restart_leaving_replica_during_cleanup The test started hitting #21779 recently. We deflake it in this commit by disabling the tablet load balancing before dropping the keyspace at the end of the test. We still have to understand why the test started hitting #21779, so we keep #25938 open. Refs #25938	2025-09-11 13:19:51 +02:00
Patryk Jędrzejczak	eae12c1717	test: cluster: add a test for restarts with no group 0 quorum We don't have such a test, and we could add a group 0 quorum requirement on the restart path by mistake. A new test, no backport. Closes scylladb/scylladb#25623	2025-09-11 08:56:34 +03:00
Raphael S. Carvalho	b607b1c284	compaction: Fix stop of sstable cleanup The interface suggests the whole sstable cleanup is aborted with 'nodetool stop CLEANUP', but it is currently stopping only the ongoing cleanup task, and the compaction manager will retry the task since the error is not propagated all the way back to the caller. With raft topology, the coordinator should retry it though since cleanup became mandatory with automatic cleanup. So it's only fixing the usage where cleanup is issued manually. The stop exception is only propagated to the caller of cleanup. When stopping tasks during shutdown, the exception is swallowed and the error only returned to the caller. Fixes #20823. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24996	2025-09-11 08:55:10 +03:00
Yaron Kaikov	902d139c80	tools: toolchain: dbuild: add setuptools_scm as dependency this package was added as a dependnancy to `cqlsh` in `216d8b0658` Fixes: https://github.com/scylladb/scylladb/issues/25613 [Yaron: regenerate frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz ] Closes scylladb/scylladb#25932	2025-09-11 08:51:28 +03:00
Cezar Moise	20ba8d4e8c	test: skip flaky test test_one_big_mutation_corrupted_on_startup The test is flaky since it tries to corrupt the commitlog in a non-deterministic way that sometimes allows the tested mutation to escape and be replayed anyhow. refs: #25627 Closes scylladb/scylladb#25950	2025-09-11 08:39:24 +03:00
Avi Kivity	c91b326d5a	Merge 'transport: replace throwing protocol_exception with returns' from Dario Mirovic Replace throwing `protocol_exception` with returning it as a result or an exceptional future in the transport server module. The goal is to improve performance. Most of the `protocol_exception` throws were made from `fragmented_temporary_buffer` module, by passing `exception_thrower()` to its `read` methods. `fragmented_temporary_buffer` is changed so that it now accepts an exception creator, not exception thrower. `fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced `fragmented_temporary_buffer_concepts::ExceptionThrower` and all methods that have been throwing now return failed result of type `utils::result_with_eptr`. This change is then propagated to the callers. The scope of this patch is `protocol_exception`, so commitlog just calls `.value()` method on the result. If the result failed, that will throw the exception from the result, as defined by `utils::result_with_eptr_throw_policy`. This means that the behavior of commitlog module stays the same. transport server module handles results gracefully. All the caller functions that return non-future value `T` now return `utils::result_with_eptr<T>`. When the caller is a function that returns a future, and it receives failed result, `make_exception_future(std::move(failed_result).value())` is returned. The rest of the callstack up to the transport server `handle_error` function is already working without throwing, and that's how zero throws is achieved. cql3 module changes do the same as transport server module. Benchmark that is not yet merged has commit `67fbe35833e2d23a8e9c2dcb5e04580231d8ec96`, [GitHub diff view](https://github.com/scylladb/scylladb/compare/master...nuivall:scylladb:perf_cql_raw). It uses either read or write query. Command line used: ``` ./build/release/scylla perf-cql-raw --workdir ~/tmp/scylladir --smp 1 --developer-mode 1 --workload write --duration 300 --concurrency 1000 --username cassandra --password cassandra 2>/dev/null ``` The only thing changed across runs is `--workload write`/`--workload read`. Built and run on `release` target. <details> ``` throughput: mean= 36946.04 standard-deviation=1831.28 median= 37515.49 median-absolute-deviation=1544.52 maximum=39748.41 minimum=28443.36 instructions_per_op: mean= 108105.70 standard-deviation=965.19 median= 108052.56 median-absolute-deviation=53.47 maximum=124735.92 minimum=107899.00 cpu_cycles_per_op: mean= 70065.73 standard-deviation=2328.50 median= 69755.89 median-absolute-deviation=1250.85 maximum=92631.48 minimum=66479.36 ⏱ real=5:11.08 user=2:00.20 sys=2:25.55 cpu=85% ``` ``` throughput: mean= 40718.30 standard-deviation=2237.16 median= 41194.39 median-absolute-deviation=1723.72 maximum=43974.56 minimum=34738.16 instructions_per_op: mean= 117083.62 standard-deviation=40.74 median= 117087.54 median-absolute-deviation=31.95 maximum=117215.34 minimum=116874.30 cpu_cycles_per_op: mean= 58777.43 standard-deviation=1225.70 median= 58724.65 median-absolute-deviation=776.03 maximum=64740.54 minimum=55922.58 ⏱ real=5:12.37 user=27.461 sys=3:54.53 cpu=83% ``` ``` throughput: mean= 37107.91 standard-deviation=1698.58 median= 37185.53 median-absolute-deviation=1300.99 maximum=40459.85 minimum=29224.83 instructions_per_op: mean= 108345.12 standard-deviation=931.33 median= 108289.82 median-absolute-deviation=55.97 maximum=124394.65 minimum=108188.37 cpu_cycles_per_op: mean= 70333.79 standard-deviation=2247.71 median= 69985.47 median-absolute-deviation=1212.65 maximum=92219.10 minimum=65881.72 ⏱ real=5:10.98 user=2:40.01 sys=1:45.84 cpu=85% ``` ``` throughput: mean= 38353.12 standard-deviation=1806.46 median= 38971.17 median-absolute-deviation=1365.79 maximum=41143.64 minimum=32967.57 instructions_per_op: mean= 117270.60 standard-deviation=35.50 median= 117268.07 median-absolute-deviation=16.81 maximum=117475.89 minimum=117073.74 cpu_cycles_per_op: mean= 57256.00 standard-deviation=1039.17 median= 57341.93 median-absolute-deviation=634.50 maximum=61993.62 minimum=54670.77 ⏱ real=5:12.82 user=4:10.79 sys=11.530 cpu=83% ``` This shows ~240 instructions per op increase for reads and ~180 instructions per op increase for writes. Tests have been run multiple times, with almost identical results. Each run lasted 300 seconds. Number of operations executed is roughly 38k per second 300 seconds = 11.4m ops. Update: I have repeated the benchmark with clean state - reboot computer, put in performance mode, rebuild, closed other apps that might affect CPU and disk usage. run count: 5 times before and 5 times after the patch duration: 300 seconds Average write throughput median before patch: 41155.99 Average write throughput median after patch: 42193.22 Median absolute deviation is also lower now, with values in range 350-550, while the previous runs' values were in range 750-1350. </details> Built and run on `release` target. <details> ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null ``` throughput: mean= 14910.90 standard-deviation=477.72 median= 14956.73 median-absolute-deviation=294.16 maximum=16061.18 minimum=13198.68 instructions_per_op: mean= 659591.63 standard-deviation=495.85 median= 659595.46 median-absolute-deviation=324.91 maximum=661184.94 minimum=658001.49 cpu_cycles_per_op: mean= 213301.49 standard-deviation=2724.27 median= 212768.64 median-absolute-deviation=1403.85 maximum=225837.15 minimum=208110.12 ⏱ real=5:19.26 user=5:00.22 sys=15.827 cpu=98% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null ``` throughput: mean= 93345.45 standard-deviation=4499.00 median= 93915.52 median-absolute-deviation=2764.41 maximum=104343.64 minimum=79816.66 instructions_per_op: mean= 65556.11 standard-deviation=97.42 median= 65545.11 median-absolute-deviation=71.51 maximum=65806.75 minimum=65346.25 cpu_cycles_per_op: mean= 34160.75 standard-deviation=803.02 median= 33927.16 median-absolute-deviation=453.08 maximum=39285.19 minimum=32547.13 ⏱ real=5:03.23 user=4:29.46 sys=29.255 cpu=98% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null ``` throughput: mean= 206982.18 standard-deviation=15894.64 median= 208893.79 median-absolute-deviation=9923.41 maximum=232630.14 minimum=127393.34 instructions_per_op: mean= 35983.27 standard-deviation=6.12 median= 35982.75 median-absolute-deviation=3.75 maximum=36008.24 minimum=35952.14 cpu_cycles_per_op: mean= 17374.87 standard-deviation=985.06 median= 17140.81 median-absolute-deviation=368.86 maximum=26125.38 minimum=16421.99 ⏱ real=5:01.23 user=4:57.88 sys=0.124 cpu=98% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null ``` throughput: mean= 16198.26 standard-deviation=902.41 median= 16094.02 median-absolute-deviation=588.58 maximum=17890.10 minimum=13458.74 instructions_per_op: mean= 659752.73 standard-deviation=488.08 median= 659789.16 median-absolute-deviation=334.35 maximum=660881.69 minimum=658460.82 cpu_cycles_per_op: mean= 216070.70 standard-deviation=3491.26 median= 215320.37 median-absolute-deviation=1678.06 maximum=232396.48 minimum=209839.86 ⏱ real=5:17.33 user=4:55.87 sys=18.425 cpu=99% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null ``` throughput: mean= 97067.79 standard-deviation=2637.79 median= 97058.93 median-absolute-deviation=1477.30 maximum=106338.97 minimum=87457.60 instructions_per_op: mean= 65695.66 standard-deviation=58.43 median= 65695.93 median-absolute-deviation=37.67 maximum=65947.76 minimum=65547.05 cpu_cycles_per_op: mean= 34300.20 standard-deviation=704.66 median= 34143.92 median-absolute-deviation=321.72 maximum=38203.68 minimum=33427.46 ⏱ real=5:03.22 user=4:31.56 sys=29.164 cpu=99% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null ``` throughput: mean= 223495.91 standard-deviation=6134.95 median= 224825.90 median-absolute-deviation=3302.09 maximum=234859.90 minimum=193209.69 instructions_per_op: mean= 35981.41 standard-deviation=3.16 median= 35981.13 median-absolute-deviation=2.12 maximum=35991.46 minimum=35972.55 cpu_cycles_per_op: mean= 17482.26 standard-deviation=281.82 median= 17424.08 median-absolute-deviation=143.91 maximum=19120.68 minimum=16937.43 ⏱ real=5:01.23 user=4:58.54 sys=0.136 cpu=99% ``` </details> Fixes: #24567 This PR is a continuation of #24738 [transport: remove throwing protocol_exception on connection start](https://github.com/scylladb/scylladb/pull/24738). This PR does not solve a burning issue, but is rather an improvement in the same direction. As it is just an enhancement, it should not be backported. Closes scylladb/scylladb#25408 * github.com:scylladb/scylladb: test/cqlpy: add protocol exception tests test/cqlpy: `test_protocol_exceptions.py` refactor message frame building test/cqlpy: `test_protocol_exceptions.py` refactor duplicate code transport: replace `make_frame` throw with return result cql3: remove throwing `protocol_exception` transport: replace throw in validate_utf8 with result_with_exception_ptr return transport: replace throwing protocol_exception with returns utils: add result_with_exception_ptr test/cqlpy: add unknown compression algorithm test case	2025-09-10 21:54:15 +03:00
Yaron Kaikov	0a025d121f	packaging: Add `adduser` as dependnacy As `adduser` command is being used by `/var/lib/dpkg/info/scylla-server.postinst` and similar during rpm post-install. Fixes: https://github.com/scylladb/scylladb/issues/23722 Closes scylladb/scylladb#25928	2025-09-10 21:51:25 +03:00
Avi Kivity	fc64333040	Merge 'sstables/trie: add BTI index readers and writers' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25506/ Next part: plugging the BTI index readers and writers into sstable readers and writers. The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability. This series implements, on top of the key translation logic, and abstract trie writing and traversal logic, a writer and a reader of sstable index files (which map primary keys to positions in Data.db), as described in `f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md`. Caveats: 1. I think the added test has reasonable coverage, but that depends on running it multiple times. (Though it shouldn't need more than a few runs to catch any bug it covers). It's somewhat awkward as a test meant for running in CI, it's better as something you run many times after a relevant change. 2. These readers and writers are intended to be compatible with Cassandra, but I did NOT do any compatibility testing. The writers and readers added here have only been tested against each other, not against Cassandra's readers and writers. 3. This didn't undergo any proper benchmarking and optimization work. I was doing some measurements in the past, but everything was rewritten so much since then that the my old measurements are effectively invalidated. Frankly I have no idea what the performance of all this branchy-branchy logic is now. No backports needed, new functionality. Closes scylladb/scylladb#25626 * github.com:scylladb/scylladb: test/manual: add bti_cassandra_compatibility_test test/lib/random_schema: add some constraints for generated uuid and time/date values test/lib/random_utils: add a variant of get_bytes which takes an `engine&` test/boost: add bti_index_test sstables/writer: add an accessor for the current write position in Data.db sstables/trie: introduce bti_index_reader sstables/trie: add bti_partition_index_writer.cc sstables/trie: add bti_row_index_writer.cc utils/bit_cast: add a new overload of write_unaligned() sstables/trie: add trie_writer::add_partial() sstables/consumer: add read_56() sstables/trie: make bti_node_reader::page_ptr copy-constructible sstables: extract abstract_index_reader from index_reader.hh to its own header sstables/trie: add an accessor to the file_writer under bti_node_sink sstables/types: make `deletion_time::operator tombstone()` const sstables/types: add sstables::deletion_time::make_live() sstables/trie: fix a special case in max_offset_from_child sstables/trie: handle `partition_region`s other than `clustered` in BTI position encoding sstables/trie: rewrite lcb_mismatch to handle fragment invalidation test/boost/bti_key_translation_test: fix a compilation error hidden behind `if constexpr`	2025-09-10 21:48:52 +03:00
Pavel Emelyanov	88a01308e7	api: Move /storage_service/keyspaces handler to database module The handler uses database service, not storage_service, and should belong to the corresponding API module from column_family.cc Once moved, the handler can use captured sharded<database> reference and forget about http_context::db. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25834	2025-09-10 17:01:11 +02:00
Nadav Har'El	ce4592d8fc	Merge 'test: cluster: deflake consistency checks after decommission' from Patryk Jędrzejczak In the Raft-based topology, a decommissioning node is removed from group 0 after the decommission request is considered finished (and the token ring is updated). Therefore, `check_token_ring_and_group0_consistency` called just after decommission might fail when the decommissioned node is still in group 0 (as a non-voter). We deflake all tests that call `check_token_ring_and_group0_consistency` after decommission in this PR. Fixes #25809 This PR improves CI stability and changes only tests, so it should be backported to all supported branches. Closes scylladb/scylladb#25927 * github.com:scylladb/scylladb: test: cluster: deflake consistency checks after decommission test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency	2025-09-10 17:57:02 +03:00
Dawid Pawlik	1ce76a6ca2	cqlpy/test_vector_index: add vector index version test Test if the index version is the same as the base table version before the index was created. Test if recreating the index with the same parameters changes the version. Test if altering the base table does not change the version. Test if the user cannot specify the index version option by themself.	2025-09-10 15:19:36 +02:00
Dawid Pawlik	909a51e524	vector_index, index_prop_defs: add version to index options Since creating the vector index does not lead to creation of a view table [#24438] (whose version info had been logged in `system_schema.scylla_tables`) we lack the information about the version of the index. The mentioned version is used to recognize the quick-drop-create index with the same parameters that needs to be rebuild. The case is mainly experienced while testing, benchmarking or experimenting with Vector Search. Nevertheless it is important to have it considered, as it is really weird having seen that DROP and CREATE commands did not change anything. Although being nice "optimization" to use the same old index, the rebuild feels more natural for the get-to-know-VS-users. Should not change anything in a real production environment. The solution we arrived at is to add the version as a field in options column of `system_schema.indexes`. The version of vector index is a base table's schema version on which the index was created. The table's schema version changes everytime a table is changed meaning that CREATE INDEX or DROP INDEX statement also change it. Every index has a different index version, so it allows to identify them easily. This patch implements the solution described above.	2025-09-10 15:16:54 +02:00
Michał Chojnowski	47c2d09c22	test/manual: add bti_cassandra_compatibility_test Adds a heavy test which tests compatibility of BTI index files between Cassandra and Scylla. It's composed from a C++ part, used to read and write BTI files with Scylla's readers and writers, and a Python part which uses a Cassandra node and the C++ executable to make them read and write each other's files. The stages of the test are: 1. Use the C++ part to generate a random BIG sstable, and matching BTI index files. 2. Import the BIG files into Cassandra, let it generate its own BTI index files. 3. Read both Scylla's BTI and Cassandra's BTI index files using the C++ part. Check that they return the right positions and tombstones for each partition and row. 4. Sneakily swap Cassandra's BTI files for Scylla's BTI files, and query Cassandra (via CQL) for each row. Check that each query returns the right result. Not much can be inferred about the index via CQL queries, so the check we are doing on Cassandra is relatively weak. But in conjunction with the checks done on the Scylla part, it's probably good enough. The test is weird enough, and with heavy-enough dependencies (it uses a podman container to run the Cassandra) that ith has been put in test/manual. To run the test, build `build/$build_mode/test/manual/bti_cassandra_compatibility_test_g`, and run `python test/manual/bti_cassandra_compatibility_test.py`. Note: there's a lot of things that could go wrong in this test. (E.g. file permission issues or port mapping issues due to the container usage, incompatibilities between the Python driver and the random CQL values generated by generate_random_mutations, etc). I hope it works everywhere, but I only tested it on my machine, running it inside the dbuild container.	2025-09-10 13:04:42 +02:00
Botond Dénes	514f59d157	tools/scylla-sstable: write: move to UUID generation We are moving away from integer generations, so stop using them. Also drop the --generation command-line parameter, UUID generations don't have be provided by the caller, because random UUIDs will not collide with each other. To help the caller still know what generation the output sstable has (previously they provided it via --generation), print the generation to stdout. Closes scylladb/scylladb#25166	2025-09-10 13:47:26 +03:00
Aleksandra Martyniuk	20f55ea1b8	compaction: handle exception in expected_total_workload expected_total_workload methods of scrub compaction tasks create a vector of table_info based on table names. If any table was already dropped, then the exception is thrown. It leaves table_info in corrupted state and node crashes with `free(): invalid size`. Return std::nullopt if an exception was thrown to indicate that total workload cannot be found. Fixes: #25941.	2025-09-10 12:13:37 +02:00
Nadav Har'El	5e7251cd40	secondary index: fix xfailing test to pass on Cassandra We have an xfailing test test_secondary_index.py::test_limit_partition which reproduces a Scylla bug in LIMIT when scanning a secondary index (Refs #22158). The point of such a reproducer is to demonstrate the bug by passing on Cassandra but failing on Scylla - yet this specific test doesn't pass on Cassandra because it expects the wrong 3 out of 4 results to be returned: The test begins with LIMIT 1 and sees the first result is (2,1), so we expect when we increase the LIMIT to 3 to see more results from the same partition (2) - and yet the test mistakenly expected the next results to come from partition 1, which is not a reasonable expectation, and doesn't happen in Cassandra (I checked both Cassandra 5 and 4). After this patch, the test passes on Cassandra (I tried 4 and 5), and continues to fail on Scylla - which returns 4 rows despite the LIMIT 3. Note that it is debatable whether this test should insist at all on which 3 items are returned by "LIMIT 3" - In Cassandra the ordering of a SELECT with a secondary index is not well defined (see discussion in Refs #23392). So an alternative implementation of this test would be to just check that LIMIT 3 returns 3 items without insisting which: # In Cassandra the ordering of a SELECT with a secondary index is not # defined (see discussion in #23392), so we don't know which three # results to expect - just that it must be a 3-item subset. rows = list(rs) assert len(rows) == 3 assert set(rows).issubset({(1,1), (1,2), (2,1), (2,2)}) However, as of yet, I did not modify this test to do this. I still believe there is value in secondary index scans having the same order as a scan without a secondary index has - and not an undefined order, and if both Scylla and Cassandra implement that in practice, it's useful for tests to validate this so we'll know if this guarantee is ever broken. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25676	2025-09-10 08:48:52 +03:00
Wojciech Mitros	1f9be235b8	mv: delete previously undetected ghost rows in PRUNE MATERIALIZED VIEW statement The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the view. Ghost rows are rows in the view with no corresponding row in the base table. Before this patch, only rows whose primary key columns of the base table had different values than any of the base rows were treated as ghost rows by the PRUNE statement. However, view rows which have a column in their primary key that's not in the base primary can also be ghost rows if this column has a different value than the base row with the same values of remaining primary key columns. That's because these rows won't be deleted unless we change value of this column in the base table to this specific value. In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic. If this column isn't the same in the base table and the view, these rows are also deleted. Fixes https://github.com/scylladb/scylladb/issues/25655 Closes scylladb/scylladb#25720	2025-09-10 07:35:00 +02:00
Patryk Jędrzejczak	520cc0eeaa	Merge 'test: fix race condition in test_long_join' from Emil Maskovsky The test could trigger gossiper API calls before the API was properly registered, causing intermittent 404 errors. Previously the test waited for the "init - starting gossiper" log, but this appears before API registration completes. Add explicit wait for gossiper API registration to ensure the endpoint is available before making requests, eliminating test flakiness. Fixes: scylladb/scylladb#25582 No backport needed: Issue only observed in master so far. Closes scylladb/scylladb#25583 * https://github.com/scylladb/scylladb: test: improve async execution in test_long_join test: fix race condition in test_long_join	2025-09-09 19:12:59 +02:00
Patryk Jędrzejczak	bb9fb7848a	test: cluster: deflake consistency checks after decommission In the Raft-based topology, a decommissioning node is removed from group 0 after the decommission request is considered finished (and the token ring is updated). Therefore, `check_token_ring_and_group0_consistency` called just after decommission might fail when the decommissioned node is still in group 0 (as a non-voter). We deflake all tests that call `check_token_ring_and_group0_consistency` after decommission in this commit. Fixes #25809	2025-09-09 19:01:12 +02:00
Patryk Jędrzejczak	e41fc841cd	test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency In the Raft-based topology, a decommissioning node is removed from group 0 after the decommission request is considered finished (and the token ring is updated). `wait_for_token_ring_and_group0_consistency` doesn't handle such a case; it only handles cases where the token ring is updated later. We fix this in this commit. We rely on the new implementation of `wait_for_token_ring_and_group0_consistency` in the following commit to fix flakiness of some tests. We also update the obsolete docstring in this commit.	2025-09-09 19:01:09 +02:00
Avi Kivity	5237a20993	Merge 'replica: Fix split compaction when tablet boundaries change' from Raphael Raph Carvalho Consider the following: 1) balancer emits split decision 2) split compaction starts 3) split decision is revoked 4) emits merge decision 5) completes merge, before compaction in step 2 finishes After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged. This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart. To fix this, let's make split compaction always work with the state when it started, not a global state. Fixes #24153. All 2025.* versions are vulnerable, so fix must be backported to them. Closes scylladb/scylladb#25690 * github.com:scylladb/scylladb: replica: Fix split compaction when tablet boundaries change replica: Futurize split_compaction_options()	2025-09-09 17:05:32 +03:00
Dawid Mędrek	789a4a1ce7	test/perf: Adjust tablet_load_balancing.cc to RF-rack-validity We modify the logic to make sure that all of the keyspaces that the test creates are RF-rack-valid. For that, we distribute the nodes across two DCs and as many racks as the provided replication factor. That may have an effect on the load balancing logic, but since this is a performance test and since tablet load balancing is still taking place, it should be acceptable. This commit also finishes work in adjusting perf tests to pass with the `rf_rack_valid_keyspaces` configuration option enabled. The remaining tests either don't attempt to create keyspaces or they already create RF-rack-valid keyspaces. We don't need to explicitly enable the configuration option. It's already enabled by default by `cql_test_config`. The reason why we haven't run into any issue because of that is that performance tests are not part of our CI. Fixes scylladb/scylladb#25127 Closes scylladb/scylladb#25728	2025-09-09 12:46:46 +03:00
Botond Dénes	a89d0a747b	Merge 'test.py: add different levels of verbosity for output' from Andrei Chekun Add another level of verbosity: quiet. Before this it was used as a default one, but it provides not enough information. These changes should be coupled with pytest-sugar plugin to have an intended information for each level. Invoke the pytest as a module, instead of a separate process, to get access to the terminal to be able to it interactively. Framework change only, so backporting in to 2025.3 Fixes: #25403 Closes scylladb/scylladb#25698 * github.com:scylladb/scylladb: test.py: add additional level of verbosity for output test.py: start pytest as a module instead of subprocess	2025-09-09 11:49:51 +03:00
Asias He	cb7db47ae1	repair: Add incremental_mode option for tablet repair This patch introduces a new `incremental_mode` parameter to the tablet repair REST API, providing more fine-grained control over the incremental repair process. Previously, incremental repair was on and could not be turned off. This change allows users to select from three distinct modes: - `regular`: This is the default mode. It performs a standard incremental repair, processing only unrepaired sstables and skipping those that are already repaired. The repair state (`repaired_at`, `sstables_repaired_at`) is updated. - `full`: This mode forces the repair to process all sstables, including those that have been previously repaired. This is useful when a full data validation is needed without disabling the incremental repair feature. The repair state is updated. - `disabled`: This mode completely disables the incremental repair logic for the current repair operation. It behaves like a classic (pre-incremental) repair, and it does not update any incremental repair state (`repaired_at` in sstables or `sstables_repaired_at` in the system.tablets table). The implementation includes: - Adding the `incremental_mode` parameter to the `/storage_service/repair/tablet` API endpoint. - Updating the internal repair logic to handle the different modes. - Adding a new test case to verify the behavior of each mode. - Updating the API documentation and developer documentation. Fixes #25605 Closes scylladb/scylladb#25693	2025-09-09 06:50:21 +03:00
Avi Kivity	c4ed7dd814	Merge 'gossiper: fix issues in processing gossip status during the startup and when messages are delayed to avoid empty host ids' from Emil Maskovsky Populate the local state during gossiper initialization in start_gossiping, preventing an empty state from being added to _endpoint_state_map and returned in get_endpoint_states responses, that was causing an 'empty host id issue' on the other nodes during nodes restart. Check for a race condition in do_apply_state_locally In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked. During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash. This change adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map. Fixes https://github.com/scylladb/scylladb/issues/25831 Fixes https://github.com/scylladb/scylladb/issues/25803 Fixes https://github.com/scylladb/scylladb/issues/25702 Fixes https://github.com/scylladb/scylladb/issues/25621 Ref https://github.com/scylladb/scylla-enterprise/issues/5613 Backport: The issue affects all current releases(2025.x), therefore this PR needs to be backported to all 2025.1-2025.3. Closes scylladb/scylladb#25849 * github.com:scylladb/scylladb: gossiper: fix empty initial local node state gossiper: add test for a race condition in start_gossiping gossiper: check for a race condition in `do_apply_state_locally` test/gossiper: add reproducible test for race condition during node decommission	2025-09-08 20:51:01 +03:00
Andrei Chekun	ea4cd431c9	test.py: add pytest-sugar plugin to the dependencies This plugin allows having better terminal output with progress bar for the tests. Closes scylladb/scylladb#25845 [avi: regenerate frozen toolchain] Closes scylladb/scylladb#25860	2025-09-08 20:50:02 +03:00
Radosław Cybulski	6d150e2d0c	Fix oversized allocation in paxos under pressure When cpu pressured, `_locks` structure in paxos might grow and cause oversized allocations and performance drops. We reserve memory ahead of time. Fixes #25559 Closes scylladb/scylladb#25874	2025-09-08 20:49:00 +03:00
Yaron Kaikov	d57741edc2	build_docker.sh: enable debug symboles installation Adding the latest scylla.repo location to our docker container, this will allow installation scylla-debuginfo package in case it's needed Fixes: https://github.com/scylladb/scylladb/issues/24271 Closes scylladb/scylladb#25646	2025-09-08 18:39:27 +03:00
Emil Maskovsky	f8c297ca27	test: improve async execution in test_long_join Replace list comprehensions with asyncio.gather() to await the injection API calls in fully concurrent manner.	2025-09-08 17:14:37 +02:00
Emil Maskovsky	a86bd06f08	test: fix race condition in test_long_join The test could trigger gossiper API calls before the API was properly registered, causing intermittent 404 errors. Previously the test waited for the "init - starting gossiper" log, but this appears before API registration completes. Add explicit wait for gossiper API registration to ensure the endpoint is available before making requests, eliminating test flakiness. Fixes: scylladb/scylladb#25582	2025-09-08 17:14:37 +02:00
Dario Mirovic	ef83d6b970	docs: cql: default create keyspace syntax This patch updates the create keyspace statement docs. It explains how the `replication` option in the create keyspace statement is now optional, and behaves the same as if we specified an empty set as following: `WITH replication = {}`. An example with no `replication` option specified has also been added. Refs #25145	2025-09-08 15:25:30 +02:00
Dario Mirovic	d92ceed19a	test: cqlpy: add test for create keyspace with no options specified This patch introduces one new test case. It tests that a keyspace can be created without specifying replication options. Since other options already had defaults, this test assures a keyspace can be created with no options specified at all, with the following query: `CREATE KEYSPACE ks;` Refs #25145	2025-09-08 15:25:23 +02:00
Pavel Emelyanov	34d1648d21	main: Properly handle zero allocation warning threshold The --help text says about --large-memory-allocation-warning-threshold: "Warn about memory allocations above this size; set to zero to disable." That's half-true: setting the value to zero spams logs with warnings of allocation of any size, as seastar treats zero threshold literaly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25850	2025-09-08 12:41:19 +02:00
Asias He	451e1ec659	streaming: Fix use after move in the tablet_stream_files_handler The files object is moved before the log when stream finishes. We've logged the files when the stream starts. Skip it in the end of streaming. Fixes #25830 Closes scylladb/scylladb#25835	2025-09-08 11:59:52 +02:00
Sergey Zolotukhin	b34d543f30	gossiper: fix empty initial local node state This change removes the addition of an empty state to `_endpoint_state_map`. Instead, a new state is created locally and then published via replicate, avoiding the issue of an empty state existing in `_endpoint_state_map` before the preemption point. Since this resolves the issue tested in `test_gossiper_empty_self_id_on_shadow_round`, the `xfail` mark has been removed. Fixes: scylladb/scylladb#25831	2025-09-08 11:38:31 +02:00
Sergey Zolotukhin	775642ea23	gossiper: add test for a race condition in start_gossiping This change adds a test for a race condition in `start_gossiping` that can lead to an empty self state sent in `gossip_get_endpoint_states_response`. Test for scylladb/scylladb#25831	2025-09-08 11:38:30 +02:00
Sergey Zolotukhin	f08df7c9d7	gossiper: check for a race condition in `do_apply_state_locally` In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked. During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash. This change 1. adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map. 2. Removes xfail from the test_gossiper_race test since the issue is now fixed. 3. Adds exception handling in `do_shadow_round` to skip responses from nodes that sent an empty host ID. This re-applies the commit `13392a40d4` that was reverted in `46aa59fe49`, after fixing the issues that caused the CI to fail. Fixes: scylladb/scylladb#25702 Fixes: scylladb/scylladb#25621 Ref: scylladb/scylla-enterprise#5613	2025-09-08 11:38:30 +02:00
Emil Maskovsky	28e0f42a83	test/gossiper: add reproducible test for race condition during node decommission This change introduces a targeted test that simulates the gossiper race condition observed during node decommissioning. The test delays gossip state application and host ID lookup to reliably reproduce the scenario where `gossiper::get_host_id()` is called on a removed endpoint, potentially triggering an abort in `apply_new_states`. There is a specific error injection added to widen the race window, in order to increase the likelihood of hitting the race condition. The error injection is designed to delay the application of gossip state updates, for the specific node that is being decommissioned. This should then result in the server abort in the gossiper. This re-applies the commit `5dac4b38fb` that was reverted in `dc44fca67c`, but modified to relax the check from "on_internal_error" to a just warning log. The more strict can be re-introduced later once we are sure that all remaining problems are resolved and it will not break the CI. Refs: scylladb/scylladb#25621 Fixes: scylladb/scylladb#25721	2025-09-08 11:38:30 +02:00
Dawid Mędrek	bb0255b2fb	tools/scylla-sstable: Enable rf_rack_valid_keyspaces Enabling the configuration option should have no negative impact on how the tool behaves. There is no topology and we do not create any keyspaces (except for trivial ones using `SimpleStrategy` and RF=1), only their metadata. Thanks to that, we don't go through validation logic that could fail in presence of an RF-rack-invalid keyspace. On the other hand, enabling `rf_rack_valid_keyspaces` lets the tool access code hidden behind that option. While that might not be of any consequence right now, in the future it might be crucial (for instance, see: scylladb/scylladb#23030). Note that other tools don't need an adjustment: * scylla-types: it uses schema_builder, but it doesn't reuse any other relevant part of Scylla. * nodetool: it manages Scylla instances but is not an instance itself, and it does not reuse any codepaths. * local-file-key-generator: it has nothing to do with Scylla's logic. Other files in the `tools` directory are auxiliary and are instructed with an already created instance of `db::config`. Hence, no need to modify them either. Fixes scylladb/scylladb#25792 Closes scylladb/scylladb#25794	2025-09-08 11:52:43 +03:00
Yaron Kaikov	b07505a314	auto-backport.py: sync P0 and P1 labels when applied When triggering the backport process, adding a check for P0 and P1 labels, if available add them to backport PR together with force_on_cloud label Implementing first in pkg to test the process, then will move it to scylladb Fixes: PKG-62 Closes scylladb/scylladb#25856	2025-09-08 11:42:36 +03:00
Yaron Kaikov	407b7b0e18	Fix label parsing logic in backport check script Previously, the script attempted to assign GitHub Actions expressions directly within a Bash string using '${{ ... }}', which is invalid syntax in shell scripts. This caused the label JSON to be treated as a literal string instead of actual data, leading to parsing failures and incorrect backport readiness checks. This update ensures the label data is passed correctly via the LABELS_JSON environment variable, allowing jq to properly evaluate label names and conditions. Fixes: PKG-74 Closes scylladb/scylladb#25858	2025-09-08 11:42:16 +03:00
Dario Mirovic	20c173e958	cql: default `CREATE KEYSPACE` syntax Since all the options except `REPLICATION` already have defaults, and `REPLICATION` has defaults for all the fields inside, this patch makes `REPLICATION` optional. More specifically, there is no need for `WITH` in create keyspace statement anymore. This allows for the following syntax: `CREATE KEYSPACE [IF NOT EXISTS] <name>;` For example: `CREATE KEYSPACE test_keyspace;` Fixes #25145	2025-09-08 10:07:40 +02:00
Pawel Pery	61ee630f42	vector_store_client: add timeouts to tests Sometimes `vector_store_client_test_ann_request` test hangs up. It is hard to reproduce. It seems that the problem is that tests are unreliable in case of stalled requests. This patch attaches a timer to the abort_source to ensure that the test will finish with a timeout at least. Fixes: VECTOR-150 Fixes: #25234 Closes scylladb/scylladb#25301	2025-09-08 10:20:48 +03:00
Wojciech Mitros	10b8e1c51c	storage_proxy: send hints to pending replicas Consider the following scenario: - Current replica set is [A, B, C] - write succeeds on [A, B], and a hint is logged for node C - before the hint is replayed, D bootstraps and the token migrates from C to D - hint is replayed to node C while D is pending, but it's too late, since streaming for that token is already done - C is cleaned up, replayed data is lost, and D has a stale copy until next repair. In the scenario we effectively fail to send the hint. This scenario is also more likely to happen with tablets, as it can happen for every tablet migration. This issue is particularly detrimental to materialized views. View updates use hints by default and a specific view update may be sent to just one view replica (when a single base replica has a different row state due to reordering or missed writes). When we lose a hint for such a view update, we can generate a persistent inconsistency between the base and view - ghost rows can appear due to a lost tombstone and rows may be missing in the view due to a lost row update. Such inconsistencies can't be fixed neither by repairing the view or the base table. To handle this, in this patch we add the pending replicas to the list of targets of each hint, even if the original target is still alive. This will cause some updates to be redundant. These updates are probably unavoidable for now, but they shouldn't be too common either. The scenarios for them are: 1. managing to send the hint to the source of a migrating replica before streaming that its token - the write will arrive on the pending replica anyway in streaming 2. the hint target not being the source of the migration - if we managed to apply the original write of the hint to the actual source of the migration, the pending replica will get it during streaming 3. sending the same hint to many targets at a similar time - while sending to each target, we'll see the same pending replica for the hint so we'll send it multiple times 4. possible retries where even though the hint was successfully sent to the main target, we failed to send it to the pending replica, so we need to retry the entire write This patch handles both tablet migrations and tablet rebuilds. In the future, for tablet migrations, we can avoid sending the hint to pending replias if the hint target is not the source fo the migration, which would allow us to avoid the redundant writes 2 and 3. For rack-aware RF, this will be as simple as checking whether the replicas are in the same rack. We also add a test case reproducing the issue. Co-Authored-By: Raphael S. Carvalho <raphaelsc@scylladb.com> Fixes https://github.com/scylladb/scylladb/issues/19835 Closes scylladb/scylladb#25590	2025-09-08 09:18:20 +02:00
Pavel Emelyanov	9deea3655f	s3: Fix chunked download source metrics calculations In S3 client both read and write metrics have three counters -- number of requests made, number of bytes processed and request latency. In most of the cases all three counters are updated at once -- upon response arrival. However, in case of chunked download source this way of accounting metrics is misleading. In this code the request is made once, and then the obtained bytes are consumed eventually as the data arrive. Currently, each time a new portion of data is read from the socket the number of read requests is incremented. That's wrong, the request is made once, and this counter should also be incremented once, not for every data buffer that arrived in response. Same for read request latency -- it's "added" for every data buffer that arrives, but it's a lenghy process, the _request_ latency should be accounted once per responce. Maybe later we'll want to have "data latency" metrics as well, but for what we have now it's request latency. The number of read bytes is accounted properly, so not touched here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25770	2025-09-08 09:49:03 +03:00
Avi Kivity	03ee862b50	cql3: statement_restrictions: forbid querying a single-column or token restriction on a multi-column restriction In `41880bc893` ("cql3: statement_restrictions: forbid querying a single-column inequality restriction on a multi-column restriction"), we removed the ability to contrain a single column on a tuple inequality, on the grounds that it isn't used and can't be used. Here, we extend this to remove the ability to constrain a single column on a tuple equality, on the grounds that it isn't used and hampers further refactoring. CQL supports multi-column equality restrictions in the form (ck1, ck2, ck3) = (:v1, :v2, :v3) These restriction shape is only allowed on clustering keys, and is translated into a partition_slice allowing the primary index to efficiently select the part of the partition that satisfies the restriction. The possible_lhs_values() values function allows extracting single-column restrictions from this and similar tuple restrictions. For example, the multi-column restriction (ck1, ck2, ck3) = (:v1, :v2, :v3) implies that ck2 = :v2. If we have an index on ck2, and if we don't further have a restriction on the partition key, then it is advantageous to use the index to select rows, and then filter on ck1 and ck3 to satisfy the full restriction. However, we never actually do that. The following sequence ```cql CREATE TABLE ks.t1 ( pk int, ck1 int, ck2 int, PRIMARY KEY (pk, ck1, ck2) ); CREATE INDEX ON ks.t1(ck1); SELECT * FROM ks.t1 WHERE (ck1, ck2) = (1, 2); ``` Could have been used to query a single partition via the index, but instead is used for a full table scan, using the partition slice to skip through unselected rows. We can't easily start using a new query plan via the index, since switching plans mid-query (due to paging and moving from one coordinator to another during upgrade) would cause the sort order to change, therefore causing some rows to be omitted and some rows to be returned twice. Similarly, we cannot extract a token restriction from a tuple, since the grammar doesn't allow for ```cql WHERE (token(pk)) = (:var1) ``` Since it's not used, remove it. This code was first introduced in `d33053b841` ("cql3/restrictions: Add free functions over new classes") It does not directly correspond to pre-expression code. Closes scylladb/scylladb#25757 Closes scylladb/scylladb#25821	2025-09-07 18:36:05 +03:00
Nadav Har'El	040d6e2245	Merge 'interval: specialize for trivially copyable types' from Avi Kivity Interval's copy and move constructors are full of branches since the two payload T:s are optional and therefore have to be optionally-constructed. This can be eliminated for trivially copyable types (like dht::token) by eliminating interval's user-defined special member functions (constructors etc) in that special case. In turn, this enables optimizations in the standard library (and our own containers) that convert moves/copies of spans of such types into memcpy(). Minor optimization, not a candidate to backport. Closes scylladb/scylladb#25841 * github.com:scylladb/scylladb: test: nonwrapping_interval_test: verify an interval of tokens is trivial interval: specialize interval_data<T> for trivial types interval: split data members into new interval_data class	2025-09-07 17:10:32 +03:00
Raphael S. Carvalho	68f23d54d8	replica: Fix split compaction when tablet boundaries change Consider the following: 1) balancer emits split decision 2) split compaction starts 3) split decision is revoked 4) emits merge decision 5) completes merge, before compaction in step 2 finishes After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged. This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart. To fix this, let's make split compaction always work with the state when it started, not a global state. Fixes #24153. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-09-07 05:20:23 -03:00
Raphael S. Carvalho	0c1587473c	replica: Futurize split_compaction_options() Prepararation for the fix of #24153. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-09-07 05:19:09 -03:00
Michał Chojnowski	77dcb2bcda	test/lib/random_schema: add some constraints for generated uuid and time/date values I want to write a test which generates a random table (random schema, random data) and uses the Python driver to query it. But it turns out that some values generated by test/lib/random_schema can't be deserialized by the Python driver. For example, it doesn't unknown uuid versions, dates before year 1 of after year 9999, or `time` values greater or equal to the number of nanoseconds in a day. AFAIK those "driver-illegal" values aren't particularly interesting for tests which use `random_schema`, so we can just not generate them.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	3ce7b761ce	test/lib/random_utils: add a variant of get_bytes which takes an `engine&` I will need it in a test later.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	927c7ecbb0	test/boost: add bti_index_test Adds a fat unit test (integration test?) for BTI index writers and readers. The test generates a small random dataset for the index writer, writes it to a BTI file, and then attempts to run all possible (and legal) sequences (up to a certain length) of index reader operations and check that the results match the expectation (dictated by a "simple" reference index reader). It is currently the only test for this new feature, but I think it's reasonably good. (I validated the coverage by looking at LLVM coverage reports and by manually adding bugs in various places and checking that the test catches it after a reasonably small number of runs). (Note that in a later series, when we hook up BTI indexes to Data.db readers/writers, the feature will also be indirectly tested by the corresponding integration tests). This is a randomized test. As such, its power grows with the number of runs. In particular, a single run has a decently high probability of not exercising parts of the code at all. (E.g. the generated dataset might have no clustering keys). Also this is a slow test. (With the default parameters, ~1s in release mode on my PC, several seconds in debug mode. (And note that this is after a custom parameter downsizing for debug mode, because this test is slowed extremely badly by debug mode, due to the forced preemption after every future)). For those two reasons, I'm not glad to include it in the test suite that runs in CI. Instead of running it once in every CI run, I'd very much rather have it run 10000 times after the tested feature is touched, and before releases. Unfortunately we don't have a precedent for something like that.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	94088f0a41	sstables/writer: add an accessor for the current write position in Data.db It will be used by index tests to know the ground truth for where each partition and row are written, so that this can be checked against the index.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	11a5eec272	sstables/trie: introduce bti_index_reader Implements an index reader (implementing abstract_index_reader, which is the interface between Data.db readers and index readers) for the BTI index written by bti_partition_index_writer and bti_row_index_writer. The underlying trie walking logic is a thin wrapper around the logic added in `trie_traversal.hh` in an earlier patch series.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	0cbf6d8d24	sstables/trie: add bti_partition_index_writer.cc Implements the Partition.db writer, which maps the (BTI-encoded) decorated keys to the position of the partition header in either Rows.db (iff the partition posesses an intra-partition index at all) or Data.db> The format of Partition.db is supposed to be as described in: `f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md (partition-index)` I didn't actually test it for compatibility with Cassandra (yet?). (I followed the docs linked above and Cassandra's source code, but it could be that I have made a mistake somewhere).	2025-09-07 00:32:02 +02:00
Michał Chojnowski	7e8fd13208	sstables/trie: add bti_row_index_writer.cc Implements the Rows.db writer, which for each partition that possesses an intra-partition index writes a trie of separators between clustering key blocks, and a header (footer?) with enough metadata (partition key, partition tombstone) to allow for a direct jump to a clustering row in Data.db. The format of Rows.db is supposed to be as described in: `f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md (row-index)` and the arbitrary details about the choice of clustering block separators follows Cassandra's choices. I didn't actually test it for compatibility with Cassandra (yet?). (I followed the docs linked above and Cassandra's source code, but it could be that I have made a mistake somewhere).	2025-09-07 00:32:02 +02:00
Michał Chojnowski	a800fef633	utils/bit_cast: add a new overload of write_unaligned() Does the same thing as the existing overload, but this one takes `std::byte` instead of `void`, and it additionally returns the pointer to the end position.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	9a4b739b8d	sstables/trie: add trie_writer::add_partial() `trie_writer` has an `add()` method which can add a new branch (key) with a payload at the end. In a later patch, we will want a way to split this addition into several steps, just a way to more naturally deal with fragmented keys. So this commit adds a method which adds new nodes but doesn't attach a payload at the end. This allows for a situation where leaf nodes are added without a payload, which is not supposed to happen. It's the user's responsibility to avoid that. Note: this might be overengineered. Maybe we should force the user to linearize the key. Maybe caring so much about fragmented keys as first-class citizens is the wrong thing to do.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	62843ceac9	sstables/consumer: add read_56() In a later commit, we want to use sstables/consumer.hh to implement a parser of BTI row index headers read from Rows.db. Partition tombstones in this file have an encoding which uses the first bit of the first byte to determine if the tombstone is live or not. If yes, then the timestamp is in the remaining 7 bits of the first byte, and the next 7 bytes, and the deletion_time is in the 4 bytes after that. So I need some way to read 1 byte, and then, depending on its value, maybe read the next 7 bytes and then 4 bytes. This commits adds a helper for reading a 7-byte int. Now that I'm typing this out, maybe that's not the smartest idea. Maybe I should just "manually" read the 11 bytes as 8, 2, 1. But I've already written this, so I might as well post it, it can always be replaced later.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	c6ad5511db	sstables/trie: make bti_node_reader::page_ptr copy-constructible In later commits, a trie cursor will be holding a `page_ptr`. Sometimes we will want to copy a cursor, in particular to do reset the upper bound of the index reader with `_lower = _upper`. But currently `page_ptr` is non-copyable -- it's a shared pointer, but with an explicit `share()` method -- so a default operator= won't work for this. Let's add a copy-assignable for `page_ptr` for this purpose.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	25e98f1ff7	sstables: extract abstract_index_reader from index_reader.hh to its own header In later parts of the series, we add a second implementation of `abstract_index_reader`. To do that, we want a header with `abstract_index_reader`, but we don't need to pull in everything else from `index_reader.hh`. So let's extract `abstract_index_reader` to its own header.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	475cb18e90	sstables/trie: add an accessor to the file_writer under bti_node_sink The row index writer will need both a trie_writer and direct access to its underlying file_writer (to write partition headers, which are "outside" of the trie). And it would be weird to keep a separate reference to the file_writer if the trie_writer already does that. So let's add accessors needed to get to the file_writer&.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	d2fd0f9592	sstables/types: make `deletion_time::operator tombstone()` const No reason no to make it const.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	1fdac048d3	sstables/types: add sstables::deletion_time::make_live() A small helper, will be useful in later commits.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	0684dbb5bd	sstables/trie: fix a special case in max_offset_from_child A `writer_node` contains a chain of multiple BTI nodes. `writer_node::_node_size` is the size (in bytes) of the entire chain. But the parent of the `writer_node` wants to know the offset to the rootmost node in the chain. This can be deduced from the `writer_node::_transition_length` and the `writer_node::_node_size`. But there's an error in the logic for that, for the special case when there are two nodes in the chain. The rootmost node will be SINGLE_NOPAYLOAD_12 if and only if the leafmost node is smaller than 16 bytes, which is true if and only if `_node_size` is smaller than 19 bytes. But the current logic compares `_node_size` against 16. That's incorrect. This patch fixes that. There was a test for this branch, but it was not good enough. It only tested payloads of size 1 and 20, but the problem is only revealed by payloads of size 13-14. The test has been extended to cover all sizes between 1 and 20.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	b2d793552f	sstables/trie: handle `partition_region`s other than `clustered` in BTI position encoding As far as I know, positions outside the clustering region are never passed to sstable indexes. But since they are representable within the argument type of lazy_comparable_bytes_from_clustering_position, let's make it handle them.	2025-09-07 00:30:08 +02:00
Avi Kivity	49b0751980	test: nonwrapping_interval_test: verify an interval of tokens is trivial Since dht::token is trivial, an interval<dht::token> ought to be trivial too. Verify that.	2025-09-06 18:41:00 +03:00
Avi Kivity	ed483647a4	interval: specialize interval_data<T> for trivial types C++ data movement algorithms (std::uninitialized_copy()) and friends and the containers that use them optimize for trivially copyable and destructible types by calling memcpy instead of using a loop around constructors/destructors. Make intervals of trivially copyable and destructible types also trivially copyable and destructible by specializing interval_data<T> not to have user-defined special member functions. This requires that T have a default constructor since we can't skip construction when !_start_exists or !_end_exists. To choose whether we specialize or not, we look at default constructiblity (see above) and trivial destructibility. This is wider than trivial copyablity (a user-defined copy constructor can exist) but is still beneficial, since the generated copy constructor for interval_data<T> will be branch-free. We don't implement the poison words in debug mode; nor are they necessary, since we no don't manage the lifetime of _start_value and _end_value manually any more but let the compiler do that for us. Note [1] prevents full conversion to memcpy for now, but we still get branch free code. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121789	2025-09-06 18:38:24 +03:00
Avi Kivity	20751517a4	interval: split data members into new interval_data class Prepare for specialized handling of trivial types by extracting the data members of wrapping_internal<T> and the special member functions (constructors/destructors/assignment) into a new interval_data<T> template. To avoid having to refer to data member with a this-> prefix, add using declarations in wrapping_interval<T>.	2025-09-06 18:31:58 +03:00
Pavel Emelyanov	b26816f80d	s3: Export memory usage gauge (metrics) The memory usage is tracked with the help of a semaphore, so just export its "consumed" units. One tricky place here is the need to skip metrics registration for scylla-sstable tool. The thing is that the tools starts the storage manager and sstables manager on start and then some of tool's operations may want to start both managers again (via cql environment) causing double metrics registration exception. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25769	2025-09-05 18:25:34 +03:00
Botond Dénes	a96d31e684	Merge 'Update workflow trigger to pull_request_target - fixing fork PR bug' from Dani Tweig The previous version had a problem: Fork PRs didn't pass the Jira credentials to the main code, which updates the Jira key status. No need for backport. This is not the Scylla code, but a fix to GitHub Actions. Closes scylladb/scylladb#25833 * github.com:scylladb/scylladb: Change pull_request event to pull_request_target - ready for merge Update workflow to use pull_request_target event - in review Change pull_request event to pull_request_target - in progress	2025-09-05 18:23:19 +03:00
Anna Stuchlik	f66580a28f	doc: add support for i7i instances This commit adds currently supported i7i and i7ie instances to the list of instance recommendations. Fixes https://github.com/scylladb/scylladb/issues/25808 Closes scylladb/scylladb#25817	2025-09-05 14:14:58 +02:00
Andrei Chekun	da4990e338	test.py: add additional level of verbosity for output Add another level of verbosity: quiet. Before this it was used as a default one, but it provides not enough information. These changes should be coupled with pytest-sugar plugin to have an intended information for each level.	2025-09-05 11:54:49 +02:00
Andrei Chekun	7e34d5aa28	test.py: start pytest as a module instead of subprocess Invoke the pytest as a module, instead of a separate process, to get access to the terminal to be able to it interactively.	2025-09-05 11:54:49 +02:00
Pavel Emelyanov	dc44fca67c	Revert "test/gossiper: add reproducible test for race condition during node decommission" This reverts commit `5dac4b38fb` as per request from #25803	2025-09-05 09:56:46 +03:00
Pavel Emelyanov	46aa59fe49	Revert "gossiper: check for a race condition in `do_apply_state_locally`" This reverts commit `13392a40d4` as per request from #25803	2025-09-05 09:56:21 +03:00
Anna Mikhlin	21ee24f7cd	trigger-scylla-ci: ignore comment from scylladbbot ignore comments posted by scylladbbot, to allow adding instruction in CI completion report of how to re-trigger CI Closes scylladb/scylladb#25838	2025-09-05 06:18:51 +03:00
dependabot[bot]	862f965196	build(deps): bump sphinx-scylladb-theme from 1.8.7 to 1.8.8 in /docs Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.7 to 1.8.8. - [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases) - [Commits](https://github.com/scylladb/sphinx-scylladb-theme/compare/1.8.7...1.8.8) --- updated-dependencies: - dependency-name: sphinx-scylladb-theme dependency-version: 1.8.8 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#25823	2025-09-04 18:24:09 +03:00
Nadav Har'El	a1ed2c9d4b	Merge 'Allow users to SELECT from CDC log tables they created.' from Dawid Pawlik Before the patch, user with CREATE access could create a table with CDC or alter the table enabling CDC, but could not query a SELECT on the CDC table they created. It was due to the fact, the SELECT permission was checked on the CDC log, and later it's "parent" - the keyspace, but not the base table, on which the user had SELECT permission automatically granted on CREATE. This patch matches the behavior of querying the CDC log to the one implemented for Materialized Views: 1. No new permissions are granted on CREATE. 2. When querying SELECT, the permissions on base table SELECT are checked. Fixes: https://github.com/scylladb/scylladb/issues/19798 Fixes: VECTOR-151 Closes scylladb/scylladb#25797 * github.com:scylladb/scylladb: cqlpy/test_permissions: run the reproducer tests for #19798 select_statement: check for access to CDC base table	2025-09-04 16:56:52 +03:00
Piotr Wieczorek	db8b7d1c89	alternator: Store LSI keys in :attrs for newly created tables Before this patch, LSI keys were stored as separate, top-level columns in the base table. This patch changes this behavior for newly created tables, so that the key columns are stored inside the `:attrs` map. Then, in the LSI's materialized view, we create a computed column for each LSI's range key that is not a key in the base table. This makes LSI storage consistent with GSIs and allows the use of a collection tombstone on `:attrs` to delete all fields in a row, except for keys, in new tables. Refs https://github.com/scylladb/scylladb/pull/24991 Refs https://github.com/scylladb/scylladb/issues/6930	2025-09-04 15:02:37 +02:00
Michał Chojnowski	b3c20e8cc3	sstables/trie: rewrite lcb_mismatch to handle fragment invalidation When cleaning up `lcb_mismatch` for review, I managed to forget that in the follow-up series I want to use it with iterators for which `it` points to data which is invalidated by `++it`. (The data in `lazy_comparable_bytes_` generators is kept a vector of `bytes`, so `it` can point to the internal storage of `bytes`. Generating a new fragment via `++it` can resize the vector, move the `bytes`, and invalidate the `it`.) So during the pre-review cleanup, `lcb_mismatch` ended up in a shape which isn't prepared for that. This commits shuffles the control flow around so that `++it` is delayed after the span obtained with `*it` is exhausted.	2025-09-04 15:02:29 +02:00
Michał Chojnowski	3c3ed867e6	test/boost/bti_key_translation_test: fix a compilation error hidden behind `if constexpr` The branch inside the `if constexpr (debug)` contains a piece of template code that doesn't typecheck properly. (I used this code before committing it, but apparently I let it become outdated when some changes around it happened). Fix that.	2025-09-04 15:02:29 +02:00
Piotr Wieczorek	089591a3db	alternator/test: Add LSI tests based mostly on the existing GSI tests This patch adds LSI tests that correspond to already existing tests for GSI, filling up some gaps in LSI: - `_wrong_type_attribute_`, - `test_lsi_duplicate_with_different_name`, - `test_lsi_empty_value_in_batch_write`, - `test_lsi_null_index`. As well as some new ones: - `test_lsi_empty_value_binary`, - `test_lsi_put_overwrites_lsi_column`, - `test_lsi_*_modifies_index`.	2025-09-04 15:02:22 +02:00
Dani Tweig	ddac32b656	Change pull_request event to pull_request_target - ready for merge Fix the fork PRs bug	2025-09-04 12:47:25 +03:00
Dani Tweig	eb0bb0f3a0	Update workflow to use pull_request_target event - in review Fix a fork PRs bug.	2025-09-04 12:42:52 +03:00
Dani Tweig	4c460464b8	Change pull_request event to pull_request_target - in progress Fix fork PRs bug.	2025-09-04 12:41:29 +03:00
Botond Dénes	db72430d82	Merge 'Don't leave pre-scrub snapshot on API error' from Pavel Emelyanov The pre-srcub snapshot is taken in the middle of parsing options from the request. In case post-snapshot part of the parsing throws (it can do so if "quarantine_mode" value is not recognized), the snapshot remains on disk, but the API call fails. The fix is to move snapshot taking out of the parse_scrub_options() helper. It could be moved at the end of it, but the helper name doesn't tell that it also takes a snapshot, so no. After the fix the helper in question can be simplified further. The issue exists in older versions, but likely doesn't reveal itself for real, so it doesn't look worthwhile to backport it. Closes scylladb/scylladb#25824 * github.com:scylladb/scylladb: api: Simplify parse_scrub_options() helper api: Take snapshot after parsing scrub options	2025-09-04 12:13:16 +03:00
Avi Kivity	169092b340	Merge 'pgo: add auth connections stress workload' from Marcin Maliszkiewicz This series improves the pgo workloads by enabling authentication and authorization and adding new stress scenarios. - Enables auth in training clusters All training workloads now run with auth enabled, following best practices and avoiding config proliferation. - Adds auth connections stress workload Introduces a workload that uses derived roles and permissions, stressing auth code paths while also creating a new connection per request to exercise server transport handling. - Enables counters workload The counters workload is re-enabled without introducing extra dependencies on cqlsh. Instead, a lightweight exec_cql.py wrapper (shared with the auth workload) handles preparation statements. Backport: no, it's not a bug fix ---------------------------------------------------------- Performance results for auth PGO there seems to be no difference, or to small to measure: scylladb pgo_auth ≡ ◦ ⤖ python3 ./pgo/auth_conns_stress.py localhost cassandra cassandra 10000 100 & scylladb pgo_auth ≡ ◦ ⤖ perf stat -e instructions --timeout 5000 -p 51591 on both before and after instructions counter varies from 179,818,558,011 to 180,664,528,198. ---------------------------------------------------------- Performance results for counters PGO is notably improved with write workload 16-22% and read 4-5%: scylladb pgo_auth ≡ ◦ ⤖ ./scylla_master perf-simple-query 2> /dev/null --counters --write random-seed=3839439576 enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=yes} Disabling auto compaction 2413435.37 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 51167 insns/op, 33157 cycles/op, 0 errors) 2413009.40 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 51053 insns/op, 33009 cycles/op, 0 errors) 2403794.31 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 50867 insns/op, 32899 cycles/op, 0 errors) 2384572.52 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 50562 insns/op, 32811 cycles/op, 0 errors) 2195388.31 tps (122.6 allocs/op, 8.0 logallocs/op, 14.1 tasks/op, 51818 insns/op, 34504 cycles/op, 0 errors) throughput: mean= 2362039.98 standard-deviation=93892.61 median= 2403794.31 median-absolute-deviation=50969.42 maximum=2413435.37 minimum=2195388.31 instructions_per_op: mean= 51093.44 standard-deviation=465.18 median= 51052.98 median-absolute-deviation=226.61 maximum=51818.04 minimum=50562.30 cpu_cycles_per_op: mean= 33275.85 standard-deviation=698.65 median= 33008.58 median-absolute-deviation=377.16 maximum=34504.13 minimum=32811.18 scylladb pgo_auth ≡ ◦ ⤖ ./scylla_master perf-simple-query 2> /dev/null --counters random-seed=1134551638 enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=yes} Disabling auto compaction Creating 10000 partitions... 5499534.56 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 21463 insns/op, 14902 cycles/op, 0 errors) 5478913.87 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 21385 insns/op, 14839 cycles/op, 0 errors) 5346525.04 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.1 tasks/op, 21454 insns/op, 15082 cycles/op, 0 errors) 5467947.74 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 21275 insns/op, 14775 cycles/op, 0 errors) 5454894.98 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 21250 insns/op, 14766 cycles/op, 0 errors) throughput: mean= 5449563.24 standard-deviation=59878.80 median= 5467947.74 median-absolute-deviation=29350.63 maximum=5499534.56 minimum=5346525.04 instructions_per_op: mean= 21365.28 standard-deviation=98.95 median= 21384.65 median-absolute-deviation=90.57 maximum=21463.17 minimum=21250.33 cpu_cycles_per_op: mean= 14872.93 standard-deviation=129.26 median= 14838.65 median-absolute-deviation=97.52 maximum=15082.44 minimum=14766.13 scylladb pgo_auth ≡ ◦ ⤖ ./scylla_pgo_counters perf-simple-query 2> /dev/null --counters --write random-seed=437758611 enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=yes} Disabling auto compaction 2950968.10 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 41540 insns/op, 27097 cycles/op, 0 errors) 2923325.10 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 41366 insns/op, 27017 cycles/op, 0 errors) 2928666.67 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 41274 insns/op, 26929 cycles/op, 0 errors) 2918378.39 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 41165 insns/op, 26880 cycles/op, 0 errors) 2209053.17 tps (128.4 allocs/op, 8.0 logallocs/op, 14.6 tasks/op, 48176 insns/op, 34726 cycles/op, 0 errors) throughput: mean= 2786078.28 standard-deviation=322807.25 median= 2923325.10 median-absolute-deviation=142588.38 maximum=2950968.10 minimum=2209053.17 instructions_per_op: mean= 42704.41 standard-deviation=3062.05 median= 41366.40 median-absolute-deviation=1430.69 maximum=48176.45 minimum=41165.23 cpu_cycles_per_op: mean= 28529.92 standard-deviation=3464.99 median= 27016.51 median-absolute-deviation=1601.02 maximum=34726.49 minimum=26880.18 scylladb pgo_auth ≡ ◦ ⤖ ./scylla_pgo_counters 2> /dev/null --counters random-seed=4277130772 enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=yes} Disabling auto compaction Creating 10000 partitions... 5691320.62 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 20656 insns/op, 14279 cycles/op, 0 errors) 5708878.25 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 20486 insns/op, 14104 cycles/op, 0 errors) 5727060.22 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 20439 insns/op, 14044 cycles/op, 0 errors) 5700157.92 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 20416 insns/op, 14054 cycles/op, 0 errors) 5610730.84 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 20459 insns/op, 14195 cycles/op, 0 errors) throughput: mean= 5687629.57 standard-deviation=44972.99 median= 5700157.92 median-absolute-deviation=21248.68 maximum=5727060.22 minimum=5610730.84 instructions_per_op: mean= 20491.27 standard-deviation=95.74 median= 20459.35 median-absolute-deviation=52.13 maximum=20656.27 minimum=20415.95 cpu_cycles_per_op: mean= 14135.25 standard-deviation=100.27 median= 14104.47 median-absolute-deviation=81.48 maximum=14278.97 minimum=14043.76 Closes scylladb/scylladb#25651 * github.com:scylladb/scylladb: pgo: add links to issues about tablet missing features pgo: enable counters workload pgo: add auth connections stress workload pgo: enable auth in training clusters	2025-09-04 11:46:39 +03:00
Pavel Emelyanov	b86b4fc251	api: Simplify parse_scrub_options() helper It no longer needs to be a coroutine, nether it needs the snapshot_ctl reference argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-03 19:06:31 +03:00
Pavel Emelyanov	ee4197fa80	api: Take snapshot after parsing scrub options Parsiong scrub options may throw after a snapshot is taken thus leaving it on disk even though an operation reported as "failed". Not, probably, critical, but not nice either. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-03 19:05:50 +03:00
Marcin Maliszkiewicz	2109110037	pgo: add links to issues about tablet missing features	2025-09-03 15:43:52 +02:00
Marcin Maliszkiewicz	8aa2825caa	pgo: enable counters workload It was not enabled due to some cqlsh dependency missing. After 3 years it's hard to say if the thing is fixed or not, but anyway we don't need another big dependecy while we already have python driver used exstensively in tests. We use simple wrapper file exec_cql.py, shared with auth_conns workload to conveniently read needed preparation statements from the file. Additionally we switch tablets off as counters don't support it yet.	2025-09-03 15:43:51 +02:00
Marcin Maliszkiewicz	09476a4df8	pgo: add auth connections stress workload It uses some derived roles and permissions to exercise auth code paths and also creates new connection with each stress request to exercise also transport/server.cc connection handling code.	2025-09-03 15:43:51 +02:00
Marcin Maliszkiewicz	f2270034ec	pgo: enable auth in training clusters As it's best practice to use auth and we don't want to have 2^n configs to train we just enable auth for every workload.	2025-09-03 15:29:27 +02:00
Dawid Mędrek	d2c5268196	cql3: Produce CREATE MATERIALIZED VIEW statement when describing MV of index Before this change, executing `DESCRIBE MATERIALIZED VIEW` on the underlying materialized view of a secondary index would produce a `CREATE INDEX` statement. It was not only confusing, but it also prevented from learning about the definition of the view. The only way to do so was to query system tables. We change that behavior and produce a `CREATE MATERIALIZED VIEW` statement instead. The statement is printed as a comment to implicitly convey that the user should not attempt to execute it to restore the view. A short comment is provided to make it clearer. Before this commit: ``` cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int); cqlsh> CREATE INDEX i ON ks.t(v); cqlsh> DESCRIBE MATERIALIZED VIEW ks.i; CREATE INDEX i ON ks.t(v); ``` After this commit: ``` cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int); cqlsh> CREATE INDEX i ON ks.t(v); cqlsh> DESCRIBE MATERIALIZED VIEW ks.i; /* Do NOT execute this statement! It's only for informational purposes. This materialized view is the underlying materialized view of a secondary index. It can be restored via restoring the index. CREATE MATERIALIZED VIEW ks.i_index [...]; */ ``` Note that describing the base table has not been affected and still works as follows: ``` cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int); cqlsh> CREATE INDEX i ON ks.t(v); cqlsh> DESCRIBE TABLE ks.t; CREATE TABLE ks.t ( p int, v int, PRIMARY KEY (p) ) WITH bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} AND comment = '' AND compaction = {'class': 'IncrementalCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND speculative_retry = '99.0PERCENTILE' AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'}; CREATE INDEX i ON ks.t(v); ``` We also provide two reproducers of scylladb/scylladb#24610. Fixes scylladb/scylladb#24610 Closes scylladb/scylladb#25697	2025-09-03 15:21:37 +02:00
Piotr Dulikowski	f95808cbe7	Merge 'cdc/generation: Clone `topology_description` asynchronously' from Dawid Mędrek An instance of `cdc::topology_description` can be quite big. The vector it consists of stores as many `token_range_description`s as there are vnodes, and the size of each `token_range_description` is O(#shards). Because of that, copying an instance of the type can lead to reactor stalls. To prevent that, we introduce an asynchronous function copying the contents on the object. Reactor stalls were detected in the call to `map_reduce` in `generation_service::legacy_do_handle_cdc_generation`, so let's start using the new function there. A similar scenario occurs in `generation_service::handle_cdc_generation`, so we modify it too. Unfortunately, it doesn't seem viable to provide a reproducer of said problem. Fixes scylladb/scylladb#24522 Backport: none. Reactor stalls are not critical. Closes scylladb/scylladb#25730 * github.com:scylladb/scylladb: cdc/generation: Delete copy constructors of topology_description cdc/generation: Clone topology_description asynchronously	2025-09-03 13:41:58 +02:00
Dawid Pawlik	5e72d71188	cqlpy/test_permissions: run the reproducer tests for #19798 Since the previous commit fixes the issue, we can remove the xfail mark. The tests should pass now.	2025-09-03 13:20:39 +02:00
Dawid Pawlik	be54346846	select_statement: check for access to CDC base table Before the patch, user with CREATE access could create a table with CDC or alter the table enabling CDC, but could not query a SELECT on the CDC table they created. It was due to the fact, the SELECT permission was checked on the CDC log, and later it's "parent" - the keyspace, but not thebase table, on which the user had SELECT permission automatically granted on CREATE. This patch matches the behaviour of querying the CDC log to the one implemented for Materialized Views: 1. No new permissions are granted on CREATE. 2. When querying SELECT, the permissions on base table SELECT are checked. Fixes: #19798	2025-09-03 13:20:39 +02:00
Botond Dénes	6116f9e11b	Merge 'Compaction tasks progress' from Aleksandra Martyniuk Determine the progress of compaction tasks that have children. The progress of a compaction task is calculated using the default get_progress method. If the expected_total_workload method is implemented, the default progress is computed as: (sum of child task progresses) / (expected total workload) If expected_total_workload is not defined, progress is estimated based on children progresses. However, in this case, the total progress may increase over time as the task executes. All compaction tasks, except for reshape tasks, implement the expected_children_number method. To compute expected_total_workload, iterate over all SSTables covered by the task and sum their sizes. Note that expected_total_workload is just an approximation and the real workload may differ if SStables set for the keyspace/table/compaction group changes. Reshape tasks are an exception, as their scope is determined during execution. Hence, for these tasks expected_total_workload isn't defined and their progress (both total and completed) is determined based on currently created children. Fixes: https://github.com/scylladb/scylladb/issues/8392. Fixes: https://github.com/scylladb/scylladb/issues/6406. Fixes: https://github.com/scylladb/scylladb/issues/7845. New feature, no backport needed Closes scylladb/scylladb#15158 * github.com:scylladb/scylladb: test: add compaction task progress test compaction: set progress unit for compaction tasks compaction: find expected workload for reshard tasks compaction: find expected workload for global cleanup compaction tasks compaction: find expected workload for global major compaction tasks compaction: find expected workload for keyspace compaction tasks compaction: find expected workload for shard compaction tasks compaction: find expected workload for table compaction tasks compaction: return empty progress when compaction_size isn't set compaction: update compaction_data::compaction_size at once tasks: do not check expected workload for done task	2025-09-03 13:23:42 +03:00
Pavel Emelyanov	b0aa2d61d9	Merge 'cql3: add default replication factor to `create_keyspace_statement`' from Dario Mirovic When creating a new keyspace, replication factor must be stated. For example: `CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy', 'replication_factor': 3 };` This patch changes it in the following way - if there is no replication factor specified, use default replication factor. Default replication factor is equal to the number of racks that are not arbiter-only, i.e. racks that have at least one non-arbiter node. The following syntax is now valid: `CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy' };` `CREATE KEYSPACE ks WITH REPLICATION { };` Fixes #16028 Backport is not needed. This is an enhancement for future releases. Closes scylladb/scylladb#25570 * github.com:scylladb/scylladb: docs/cql: update documentation for default replication factor test/cqlpy: add keyspace creation default replication factor tests cql3: add default replication factor to `create_keyspace_statement`	2025-09-03 12:31:53 +03:00
Pavel Emelyanov	c0808c90b0	api: Use validate_table() helper in /storage_service/tokens_endpoint handler The handler validates if the given ks:cf pair exists in the database, then finds the table id to process further. There's a helper that does both. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25669	2025-09-03 11:44:50 +03:00
Pavel Emelyanov	b5610050a1	api: Make GET/storage_service/drain handler work on storage service POSTing on the same URL launches storage_service::drain(), so GETing on it should (not that it's restriced somehow, but still) work on the same service. This changes removes one more user of http_context::database which in turn will allow removding database reference from context eventually. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25677	2025-09-03 11:40:39 +03:00
Radosław Cybulski	7b3d42f83e	Remove unused boost macro definitions Closes scylladb/scylladb#25742	2025-09-03 10:06:33 +03:00
Radosław Cybulski	c242234552	Revert "build: add precompiled headers to CMakeLists.txt" This reverts commit `01bb7b629a`. Closes scylladb/scylladb#25735	2025-09-03 09:46:00 +03:00
Calle Wilund	bc20861afb	system_keyspace: Prune dropped tables from truncation on start/drop Fixes #25683 Once a table drop is complete, there should be no reason to retain truncation records for it, as any replay should skip mutations anyway (no CF), and iff we somehow resurrect a dropped table, this replay-resurrected data is the least problem anyway. Adds a prune phase to the startup drop_truncation_rp_records run, which ignores updating, and instead deletes records for non-existant tables (which should patch any existing servers with lingering data as well). Also does an explicit delete of records on actual table DROP, to ensure we don't grow this table more than needed even in long uptime nodes. Small unit test included. Closes scylladb/scylladb#25699	2025-09-03 07:25:34 +03:00
Sergey Zolotukhin	13392a40d4	gossiper: check for a race condition in `do_apply_state_locally` In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked. During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash. This change adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map. Fixes scylladb/scylladb#25702 Fixes scylladb/scylladb#25621 Ref scylladb/scylla-enterprise#5613 Closes scylladb/scylladb#25727	2025-09-02 20:44:21 +02:00
Piotr Dulikowski	78ef334333	Merge 'Move "cache" API endpoints registration closer to column_family ones ' from Pavel Emelyanov These two "blocks" of endpoints have different URL prefixes, but work with the same "service", which is sharded<replica::database>. The latter block had already been fixed to carry the sharded<database>& around (#25467), now it's the "cache" turn. However, since these endpoints also work with the database, there's no need in dedicated top-level set/unset machinery (similarly, gossiper has two API set/unset blocks that come together, see #19425), it's enough to just set/unset them next to each other. Ongoing http_context dependency cleanup, no need to backport Closes scylladb/scylladb#25674 * github.com:scylladb/scylladb: api: Capture and use db in cache_service handlers api: Add sharded<database>& arg to set_cache_service() api: Squash (un)set_cache_service into ..._column_family api: Coroutinize set_server_column_family()	2025-09-02 13:59:02 +02:00
Avi Kivity	7ed261fc52	Merge 'Inital GCP object storage support' from Calle Wilund Adds infrastructure and client for interaction with GCP object storage services. Note: this is just a client object usable for creating, listing, deleting and up/downloading of objects to/from said storage service. It makes no attempt at actually inserting it into the sstable storage flow. That can come later. This PR breaks out GCP auth and some general REST call functionality into shared routines. Not all code is 100% reused, but at least some. Test is added, though could be more comprehensive (feel free to suggest a test vector). Test can run in either local mock server mode (default), or against actual GCP. See `test/boost/gcp_object_storage_test.cc` for explanation on the config environment vars. Default is to run the test against a temporary docker deamon. Closes scylladb/scylladb#24629 * github.com:scylladb/scylladb: test::boost::gcp_object_storage_test: Initial unit tests for GCP obj storage proc-utils: Re-export waiting types from seastar proc-utils: Inherit environment from current process utils::gcp::object_storage: Add client for GCP object storage utils::http: Add optional external credentials to dns_connection_factory init utils::rest: Break out request wrapper and send logic encryption::gcp_host: Use shared gcp credentials + REST helpers utils::gcp: Move/add gcp credentials management to shared file utils::rest::client: Add formatter for seastar::http::reply utils::rest::client: Add helper routines for simple REST calls utils::http: Make shared system trust certificates public	2025-09-02 14:38:09 +03:00
Avi Kivity	fe308de8df	Merge 'treewide: Add missing `#pragma once`' from Ernest Zaslavsky Add missing #pragma once and license boilerplate to include headers. Consider adding a CI step to catch missing header guards early. It can be done easily by running `cpplint` like below ``` find . -path ./seastar -prune -o -path ./venv -prune -o -path ./idl -prune -o -type f $ -name ".h" -o -name ".hh" -o -name ".hpp" $ -print0 \| xargs -0 cpplint 2>&1 \| grep "header guard found" ``` No backport is needed, the change is not "functional" Closes scylladb/scylladb#25768 github.com:scylladb/scylladb: treewide: Add missing license boilerplate treewide: Add missing `#pragma once`	2025-09-02 13:18:04 +03:00
Piotr Dulikowski	762d9ef68f	Merge 'cdc: Set tombstone_gc when creating log table' from Dawid Mędrek Normally, when we create a table, MV, etc., we apply `cf_prop_defs` to the schema builder via the function `cf_prop_defs::apply_to_builder`. Unfortunately, that didn't happen when creating CDC log tables, and so we might have missed some of the properties that would normally be set to some value, even if the default one. One particular example of that phenomenon was `tombstone_gc`. For better or worse, it's not a "standalone property" of a table, but rather part of `extensions`. [Somewhat related issue: scylladb/scylladb#9722] That may have and did cause trouble. Consider this scenario: 1. A CDC log table is created. 2. The table does NOT have any value of `tombstone_gc` set. 3. The user edits the table via `ALTER TABLE`. That statement treats the log table just like any other one (at least as far as the relevant portion of the logic is concerned). Among other things, it uses `cf_prop_defs::apply_to_builder`, and as a result, the `tombstone_gc` property is set to some value: * the default one if the user doesn't specify it in the statement, * a custom one if they do. Why is that a problem? First of all, it's confusing. When we perform a schema backup and a table uses CDC, we include an ALTER statement for its corresponding CDC log table (for more context, see issue scylladb/scylladb#18467 or commit scylladb/scylladb@f12edbdd95). There are two consequences for the user here: 1. If the log table had NOT been altered ever since it was created, the statement will miss the `tombstone_gc` property as if it couldn't be set for it at all. That's confusing! 2. If the log table HAD in fact been altered after its creation, the statement will include the `tombstone_gc` property. That's even more confusing (why was it not present the first time, but it is now?). The `tombstone_gc` property should always be set to avoid confusion and problematic edge cases in tests and to simply be consistent with how other schema entities work. The solution we employ is that we always set the property to the default value. That includes the case when we reattach the log table to the base; consider the following scenario: 1. Create a table with CDC enabled. 2. Detach the log table by performing `ALTER TABLE ... WITH cdc = {'enabled': false}`. 3. Change the `tombstone_gc` property of the log table. 4. Reattach the log table to the base in the same way as in step 2. The expected result would be that the new value of `tombstone_gc` would be preserved after reattaching the log table. However, that's not what will happen. We decide to stay consistent with how other properties of a log table behave, and we reset them after every reattachment. We might change that in the future: see issue scylladb/scylladb#25523. Two reproducer tests of scylladb/scylladb#25187 are included in the changes. Backport: The problem is not critical, so it may not be necessary to backport the changes. That's to be discussed. Closes scylladb/scylladb#25521 * github.com:scylladb/scylladb: cdc: Set tombstone_gc when creating log table tombstone_gc: Add overload of get_default_tombstone_gc_mode tombstone_gc: Rename get_default_tombstonesonte_gc_mode	2025-09-02 10:20:11 +02:00
Tomasz Grabiec	a7f10b585e	Merge 'drop table: fix crash on drop table with concurrent cleanup' from Ferenc Szili Consider the following scenario: - A tablet is migrated away from a shard - The tablet cleanup stage closes the storage group's async_gate - A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate - Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash Fixes: #25706 This needs to be backported to all supported versions with tablets Closes scylladb/scylladb#25708 * github.com:scylladb/scylladb: test: reproducer and test for drop with concurrent cleanup truncate: check for closed storage group's gate in discard_sstables	2025-09-02 00:02:14 +02:00
Calle Wilund	21adfd8a60	test::boost::gcp_object_storage_test: Initial unit tests for GCP obj storage Allows testing using either local mock server (installed or using docker), or real GCP project (not tested as of writing this). v2: Try podman if docker unavail v3: Ensure we check log output on fake-gcs, because when using podman, the published port will be connectible even though the actual server is not up yet. v4: Use ephermal port forward in docker/podman to allow us running parallel instances. Also adjust credentials and port finding in test. v5: Re-ensure no parallel tests for this: We seem to time out in podman trying to fetch image for X parallel tests v6: Remove the ephermal port stuff. Because of course this does not work with our podman-in-podman. Do brute-force port speculation instead. v7: Up timeout for server start to allow docker pull. v8: Fix string check error v9: Add explicit docker image version	2025-09-01 18:14:20 +00:00
Calle Wilund	5ead6ec420	proc-utils: Re-export waiting types from seastar Just to make directly accessible from wrapper type	2025-09-01 18:03:44 +00:00
Calle Wilund	8169327553	proc-utils: Inherit environment from current process In most cases, when launching a process from tests, we will want to inherit our own env. Add option (default true) to do so.	2025-09-01 18:03:44 +00:00
Calle Wilund	4a5b547a86	utils::gcp::object_storage: Add client for GCP object storage Adds a minial client for GCP object storage operations: * Create buckets * Delete buckets * List bucket content * Copy/move bucket content * Delete bucket content * Upload bucket content * Download bucket content	2025-09-01 18:03:44 +00:00
Calle Wilund	8f54b709ce	utils::http: Add optional external credentials to dns_connection_factory init Also allow creating the object using an endpoint expression. Note: this moves code to the .cc file, because it introduces a few more lines, and I feel we have to much stuff in headers as is.	2025-09-01 18:03:44 +00:00
Calle Wilund	0e9e1f7738	utils::rest: Break out request wrapper and send logic Allows sharing some of the wrapping and logic outside the single-call object/routine paths, using it also with an external seastar::http::client, i.e. caching resources across several calls.	2025-09-01 18:03:44 +00:00
Calle Wilund	fe4ab7f7bf	encryption::gcp_host: Use shared gcp credentials + REST helpers Removes code in favour of transplanted shared util code.	2025-09-01 18:03:44 +00:00
Calle Wilund	2b7ad605b3	utils::gcp: Move/add gcp credentials management to shared file Copied from encryption::gcp_host. Light-weight impl of gcp credentials management.	2025-09-01 18:03:44 +00:00
Calle Wilund	f6d7c7e300	utils::rest::client: Add formatter for seastar::http::reply	2025-09-01 18:03:44 +00:00
Calle Wilund	cc1e659abd	utils::rest::client: Add helper routines for simple REST calls Packing headers and unpacking response to json. Usable for esp. gcp interaction.	2025-09-01 18:03:43 +00:00
Calle Wilund	886fcf1759	utils::http: Make shared system trust certificates public So other clients/factories can share.	2025-09-01 18:03:43 +00:00
Karol Nowacki	3086d15999	cql3: Fix crash on ANN OF query when TRACING ON is enabled Executing a vector search (SELECT with ANN OF ordering) query with `TRACING ON` enabled caused a node to crash due to a null pointer dereference. This occurred because a vector index does not have an associated view table, making its `_view_schema` member null. The implementation attempted to enable tracing on this null view schema, leading to the crash. The fix adds a null check for `_view_schema` before attempting to enable tracing on the view (index) table. A regression test is included to prevent this from happening again. Fixes: VECTOR-179 Closes scylladb/scylladb#25500	2025-09-01 17:26:54 +03:00
Avi Kivity	41880bc893	cql3: statement_restrictions: forbid querying a single-column inequality restriction on a multi-column restriction CQL supports multi-column inequality restrictions in the form (ck1, ck2, ck3) >= (:v1, :v2, :v3) These restriction shape is only allowed on clustering keys, and is translated into a partition_slice allowing the primary index to efficiently select the part of the partition that satisfies the restriction. The possible_lhs_values() values function allows extracting single-column restrictions from this and similar tuple restrictions. For example, the multi-column restriction (ck1, ck2, ck3) = (:v1, :v2, :v3) implies that ck2 = :v2. If we have an index on ck2, and if we don't further have a restriction on the partition key, then it is advantageous to use the index to select rows, and then filter on ck1 and ck3 to satisfy the full restriction. For the inquality restriction, we can only infer a restriction on the first column due to lexicographical comparison. We can see that, given (ck1, ck2, ck3) >= (:v1, :v2, :v3) then ck1 >= :v1 ck2 = unbounded ck3 = unbounded and possible_lhs_values() indeed computes this. However, this is never used in practice, and it makes further refactoring difficult. If we want to convert an boolean factor of the where clause to a predicate on a column or tuple of columns, we cannot do so because we can actually generate two predicates: one on the tuple and one on the first column. Since it's not used, remove it. This code was first introduced in `d33053b841` ("cql3/restrictions: Add free functions over new classes") (search for "if (column_index_on_lhs > 0) {"). It does not directly correspond to pre-expression code. Closes scylladb/scylladb#25757	2025-09-01 17:21:26 +03:00
Artsiom Mishuta	5910ad3c6d	test.py: apply the nightly label on test_topology_recovery_basic This test is for the old gossip-based recovery procedure, which is an almost obsolete feature that won't change anymore. Closes scylladb/scylladb#25694	2025-09-01 14:16:29 +02:00
Emil Maskovsky	5dac4b38fb	test/gossiper: add reproducible test for race condition during node decommission This change introduces a targeted test that simulates the gossiper race condition observed during node decommissioning. The test delays gossip state application and host ID lookup to reliably reproduce the scenario where `gossiper::get_host_id()` is called on a removed endpoint, potentially triggering an abort in `apply_new_states`. There is a specific error injection added to widen the race window, in order to increase the likelihood of hitting the race condition. The error injection is designed to delay the application of gossip state updates, for the specific node that is being decommissioned. This should then result in the server abort in the gossiper. Refs: scylladb/scylladb#25621 Fixes: scylladb/scylladb#25721 Backport: The test is primarily for an issue found in 2025.1, so it needs to be backported to all the 2025.x branches. Closes scylladb/scylladb#25685	2025-09-01 13:59:47 +02:00
Ernest Zaslavsky	0e4292adb4	treewide: Add missing license boilerplate Add missing license boilerplate to include headers	2025-09-01 14:58:32 +03:00
Ernest Zaslavsky	19345e539f	treewide: Add missing `#pragma once` Add missing `#pragma once` to include headers	2025-09-01 14:58:21 +03:00
Petr Gusev	2e757d6de4	cas: pass timeout_if_partially_accepted := write to accept_proposal() Write requests cannot be safely retried if some replicas respond with accepts and others with rejects. In this case, the coordinator is uncertain about the outcome of the LWT: a subsequent LWT may either complete the Paxos round (if a quorum observed the accept) or overwrite it (if a quorum did not). If the original LWT was actually completed by later rounds and the coordinator retried it, the write could be applied twice, potentially overwriting effects of other LWTs that slipped in between. Read requests do not have this problem, so they can be safely retried. Before this commit, handler->accept_proposal was called with timeout_if_partially_accepted := true. This caused both read and write requests to throw an "uncertainty" timeout to the user in the case of the contention described above. After this commit, we throw an "uncertainty" timeout only for write requests, while read requests are instead retried in the loop in sp::cas. Closes scylladb/scylladb#25602	2025-09-01 14:31:04 +03:00
Pavel Emelyanov	840cdab627	api: Move /load and /metrics/load handlers code to column_family.cc Both handlers need database to proceed and thus need to be registered (and unregistered) in a group that captures database for its handlers. Once moved, the used get_cf_stats() method can be marked local to column_family.cc file. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25671	2025-09-01 08:11:00 +02:00
Dawid Mędrek	fc50e9d0a4	test/perf: Require smp=1 in perf_cache_eviction Trying to run the test with more than one shard results in a failure when generating sharding metadata: ``` ERROR 2025-08-27 16:00:17,551 [shard 0:main] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /tmp/scylla-c9fa42fe/ks/cf-2938a030834e11f0a561ffa33feb022d/me-3gt6_12wh_1gifk2ijgeu1ovc1m5-big-Data.db). Aborting ``` Let's require that the test be run with a single shard. Closes scylladb/scylladb#25703	2025-09-01 08:59:35 +03:00
Nadav Har'El	6d1abc5b2c	utils/base64: fix misleading code and comment (no functional change) utils/base64.cc had some strange code with a strange comment in base64_begins_with(). The code had base.substr(operand.size() - 4, operand.size()) The comment claims that this is "last 4 bytes of base64-encoded string", but this comment is misleading - operand is typically shorter than base (this this whole point of the base64_begins_with()), so the real intention of the code is not to find the last 4 bytes of base, but rather the next four bytes after the (operand.size() - 4) which we already copied. These four bytes that may need the full power of base64_decode_string() because they may or may not contain padding. But, if we really want the next 4 bytes, why pass operand.size() as the length of the substring? operand.size() is at least 4 (it's a mutiple of 4, and if it's 0 we returned earlier), but it could me more. We don't need more, we just need 4. It's not really wrong to take more than 4 (so this patch doesn't fix any bug), but can be wasteful. So this code should be: base.substr(operand.size() - 4, 4) We already have in test/boost/alternator_unit_test.cc a test, test_base64_begins_with that takes encoded base64 strings up to 12 characters in length (corresponding to decoded strings up to 8 chars), and substrings from length 0 to the base string's length, and check that test_base64_begins_with succeeds. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25712	2025-09-01 08:57:50 +03:00
Andrei Chekun	e55c8a9936	test.py: modify run to use different junit output filenames Currently, run will execute twice pytest without modifying the path of the JUnit XML report. This leads that the second execution of the pytest will override the report. This PR fixing this issue so both reports will be stored. Closes scylladb/scylladb#25726	2025-09-01 08:56:48 +03:00
Ernest Zaslavsky	05154e131a	cleanup: Add missing `#pragma once` Add missing `#pragma once` to include header Closes scylladb/scylladb#25761	2025-09-01 06:41:57 +03:00
Botond Dénes	fbff8d3b2d	Merge 'vector_store_client: disable Nagle's algorithm on the http client' from Pawel Pery Nagle’s algorithm and Delayed ACK’s algorithm are enabled by default on sockets in Linux. As a result we can experience 40ms latency on simply waiting for ACK on the client side. Disabling the Nagle’s algorithm (using TCP_NODELAY) should fix the issue (client won’t wait 40ms for ACKs). This change sets `TCP_NODELAY` on every socket created by the `http_client`. Checking for dead peers or network is helpful in maintaining a lifetime of the http client. This change also sets TCP_KEEPALIVE option on the http client's socket. Fixes: VECTOR-169 Closes scylladb/scylladb#25401 * github.com:scylladb/scylladb: vector_store_client: set keepalive for the http client's socket vector_store_client: disable Nagle's algorithm on the http client	2025-09-01 06:26:06 +03:00
Jenkins Promoter	619b4102bd	Update pgo profiles - x86_64	2025-09-01 05:08:56 +03:00
Jenkins Promoter	783f866bd3	Update pgo profiles - aarch64	2025-09-01 05:05:17 +03:00
Dario Mirovic	8e994b3890	test/cqlpy: add protocol exception tests Add protocol exception tests that check errors and exceptions. `test_process_startup_invalid_string_map`: `STARTUP` (0x01) with declared map count, but missing entries - `read_string_map` out-of-range. `test_process_query_internal_malformed_query`: `QUERY` (0x07) long string declared larger than available bytes - `read_long_string_view`. `test_process_query_internal_fail_read_options`: `QUERY` (0x07) with `PAGE_SIZE` flag, but truncated page_size - `read_options` path. `test_process_prepare_malformed_query`: `PREPARE` (0x09) long string declared larger than available bytes - `read_long_string_view` in prepare. `test_process_execute_internal_malformed_cache_key`: `EXECUTE` (0x0A) cache key short bytes declared larger than provided bytes - `read_short_bytes`. `test_process_register_malformed_string_list`: `REGISTER` (0x0B) string list with truncated element - `read_string_list`/`read_string`. Each test asserts an `ERROR` frame is returned and `protocol_error` metrics increase, without causing C++ exceptions. Refs: #24567	2025-08-31 23:40:03 +02:00
Dario Mirovic	84e6979adf	test/cqlpy: `test_protocol_exceptions.py` refactor message frame building Frame building is repetitive and increases verbosity, reducing code readability. This patch solves it by extracting common functionality of frame building into `_build_frame`. Also, helpers `_send_frame` and `_recv_frame` are introduced. While `_recv_frame` is not really useful, it goes well in pair with `_send_frame`. Refs: #24567	2025-08-31 23:40:01 +02:00
Dario Mirovic	19c610d9f7	test/cqlpy: `test_protocol_exceptions.py` refactor duplicate code The code that measures errors and exceptions in `test_protocol_exceptions.py` tests is repetitive. This patch refactors common functionality in a separate `_test_impl` function, improving readability. Refs: #24567	2025-08-31 23:39:58 +02:00
Avi Kivity	dfc7957a73	Merge 'test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints' from Benny Halevy Following up on `6129411a5e` improve test_vnode_keyspace_describe_ring be verifying that the endpoints listed by describe_ring match those returned by the `natural_endpoints` api (for random tokens). The latter are calculated using an independent code path directly from the effective_replication_map. * test exists currently only on master, no backport required Closes scylladb/scylladb#25610 * github.com:scylladb/scylladb: test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints test/pylib/rest_client: add natural_endpoints function	2025-08-31 20:36:15 +03:00
Avi Kivity	bae66cc0d8	Merge 'types: add byte-comparable format support for collections' from Lakshmi Narayanan Sreethar This PR builds on the byte comparable support introduced in #23541 to add byte comparable support for all the collection types. This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md Refs https://github.com/scylladb/scylladb/issues/19407 New feature - backport not required. Closes scylladb/scylladb#25603 * github.com:scylladb/scylladb: types/comparable_bytes: add compatibility testcases for collection types types/comparable_bytes: update compatibility testcase to support collection types types/comparable_bytes: support empty type types/comparable_bytes: support reversed types types/comparable_bytes: support vector cql3 type types/comparable_bytes: support tuple and UDT cql3 type types/comparable_bytes: support map cql3 type types/comparable_bytes: support set and list cql3 types types/comparable_bytes: introduce encode/decode_component types/comparable_bytes: introduce to_comparable_bytes/from_comparable_bytes	2025-08-31 15:53:27 +03:00
Avi Kivity	600349e29a	Merge 'tasks: return task::impl from make_and_start_task ' from Aleksandra Martyniuk Currently, make_and_start_task returns a pointer to task_manager::task that hides the implementation details. If we need to access the implementation (e.g. because we want a task to "return" a value), we need to make and start task step by step openly. Return task_manager::task::impl from make_and_start_task. Use it where possible. Fixes: https://github.com/scylladb/scylladb/issues/22146. Optimization; no backport Closes scylladb/scylladb#25743 * github.com:scylladb/scylladb: tasks: return task::impl from make_and_start_task compaction: use current_task_type repair: add new param to tablet_repair_task_impl repair: add new params to shard_repair_task_impl repair: pass argument by value	2025-08-31 15:44:37 +03:00
Nadav Har'El	ff91027eac	utils, alternator: fix detection of invalid base-64 This patch fixes an error-path bug in the base-64 decoding code in utils/base64.cc, which among other things is used in Alternator to decode blobs in JSON requests. The base-64 decoding code has a lookup table, which was wrongly sized 255 bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF) was included in an invalid base-64 string, instead of detecting that this is an invalid byte (since the only valid bytes in a base-64 string are A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a nonsense 6-bit part, or even crash on an out-of-bounds read. Besides the trivial fix, this patch also includes a reproducing test, which tries to write a blob as a supposedly base-64 encoded string with a 0xFF byte in it. The test fails before this patch (the write succeeds, unexpectedly), and passes after this patch (the write fails as expected). The test also passes on DynamoDB. Fixes #25701 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25705	2025-08-31 15:38:01 +03:00
Avi Kivity	1f4c9b1528	Merge 'system_keyspace: add peers cache to get_ip_from_peers_table' from Petr Gusev The gossiper can call `storage_service::on_change` frequently (see scylladb/scylla-enterprise#5613), which may cause high CPU load and even trigger OOMs or related issues. This PR adds a temporary cache for `system.peers` to resolve host_id -> ip without hitting storage on every call. The cache is short-lived to handle the unlikely case where `system.peers` is updated directly via CQL. This is a temporary fix; a more thorough solution is tracked in https://github.com/scylladb/scylladb/issues/25620. Fixes scylladb/scylladb#25660 backport: this patch needs to be backported to all supported versions (2025.1/2/3). Closes scylladb/scylladb#25658 * github.com:scylladb/scylladb: storage_service: move get_host_id_to_ip_map to system_keyspace system_keyspace: use peers cache in get_ip_from_peers_table storage_service: move get_ip_from_peers_table to system_keyspace	2025-08-31 15:34:35 +03:00
Piotr Wieczorek	5add43e15c	alternator: streams: Address minor incompatibilities with DynamoDB in GetRecords response. This commit adds missing fields to GetRecords responses: `awsRegion` and `eventVersion`. We also considered changing `eventSource` from `scylladb:alternator` to `aws:dynamodb` and setting `SizeBytes` subfield inside the `dynamodb` field. We set `awsRegion` to the datacenter's name of the node that received the request. This is in line with the AWS documentation, except that Scylla has no direct equivalent of a region, so we use the datacenter's name, which is analogous to DynamoDB's concept of region. The field `eventVersion` determines the structure of a Record. It is updated whenever the structure changes. We think that adding a field `userIdentity` bumped the version from `1.0` to `1.1`. Currently, Scylla doesn't support this field (#11523), hence we use the older 1.0 version. We have decided to leave `eventSource` as is, since it's easy to modify it in case of problems to `aws:dynamodb` used by DynamoDB. Not setting `SizeBytes` subfield inside the `dynamodb` field was dictated by the lack of apparent use cases. The documentation is unclear about how `SizeBytes` is calculated and after experimenting a little bit, I haven't found an obvious pattern. Fixes: #6931 Closes scylladb/scylladb#24903	2025-08-31 14:55:47 +03:00
Avi Kivity	bf9a963582	utils: mark crc barrett tables const They're marked constinit, but constinit does not imply const. Since they're not supposed to be modified, mark them const too. Closes scylladb/scylladb#25539	2025-08-31 11:37:39 +03:00
Avi Kivity	bc5773f777	Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational and recoverable even under extreme conditions. To achieve this, the following proactive measures are implemented: - reject writes - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes - applicable to: user tables, views, CDC log, audit, cql tracing - stop running compactions/repairs and prevent from starting new ones - reject incoming tablet migrations The aforementioned mechanisms are automatically enabled when node's disk utilization reaches the critical level (default: 98%) and disabled when the utilization drop below the threshold. Apart from that, the series add tests that require mounted volumes to simulate out of space. The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`. When not provided, tests are skipped. Test scenarios: 1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization 2. Perform an operation that would take the nodes over 100% 3. The nodes should not exceed the critical disk utilization (98% by default) 4. Scale out the cluster by adding one node per rack 5. Retry or wait for the operation from step 2 The operation is: writing data, running compactions, building materialized views, running repair, migrating tablets (caused by RF change, decommission). The test is successful, if no nodes run out of space, the operation from step 2 is aborted/paused/timed out and the operation from step 5 is successful. `perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency: Read path (before) ``` instructions_per_op: mean= 39661.51 standard-deviation=34.53 median= 39655.39 median-absolute-deviation=23.33 maximum=39708.71 minimum=39622.61 ``` Read path (after) ``` instructions_per_op: mean= 39691.68 standard-deviation=34.54 median= 39683.14 median-absolute-deviation=11.94 maximum=39749.32 minimum=39656.63 ``` Write path (before): ``` instructions_per_op: mean= 50942.86 standard-deviation=97.69 median= 50974.11 median-absolute-deviation=34.25 maximum=51019.23 minimum=50771.60 ``` Write path (after): ``` instructions_per_op: mean= 51000.15 standard-deviation=115.04 median= 51043.93 median-absolute-deviation=52.19 maximum=51065.81 minimum=50795.00 ``` Fixes: https://github.com/scylladb/scylladb/issues/14067 Refs: https://github.com/scylladb/scylladb/issues/2871 No backport, as it is a new feature. Closes scylladb/scylladb#23917 * github.com:scylladb/scylladb: tests/cluster: Add new storage tests test/scylla_cluster: Override workdir when passed via cmdline streaming: Reject incoming migrations storage_service: extend locator::load_stats to collect per-node critical disk utilization flag repair_service: Add a facility to disable the service compaction_manager: Subscribe to out of space controller compaction_manager: Replace enabled/disabled states with running state database: Add critical_disk_utilization mode database can be moved to disk_space_monitor: add subscription API for threshold-based disk space monitoring docs: Add feature documentation config: Add critical_disk_utilization_level option replica/exceptions: Add a new custom replica exception	2025-08-30 18:47:57 +03:00
Petr Gusev	898531fe7c	client_state: decoroutinize check_internal_table_permissions This function is on a hot path, better avoid allocating coroutine frames. Fixes scylladb/scylladb#25501 Closes scylladb/scylladb#25689	2025-08-30 18:46:54 +03:00
Avi Kivity	5c4a8ee134	Update seastar submodule * seastar 0a90f7945...c2d989333 (7): > Add missing `#pragma once` to response_parser.rl > simple-stream: avoid memcpy calls in fragmented streams for constant sizes > reactor: Move stopping activity out of main loop > Add sequential buffer size options to IOTune > disable exception interception when ASAN enabled > file, io_queue: Drop maybe_priority_class_ref{} from internal calls > reactor: Equip make_task() and lambda_task with concepts Closes scylladb/scylladb#25737	2025-08-30 14:53:34 +03:00
Calle Wilund	cc9eb321a1	commitlog: Ensure segment deletion is re-entrant Fixes #25709 If we have large allocations, spanning more than one segment, and the internal segment references from lead to secondary are the only thing keeping a segment alive, the implicit drop in discard_unused_segments and orphan_all can cause a recursive call to discard_unused_segments, which in turn can lead to vector corruption/crash, or even double free of segment (iterator confusion). Need to separate the modification of the vector (_segments) from actual releasing of objects. Using temporaries is the easiest solution. To further reduce recursion, we can also do an early clear of segment dependencies in callbacks from segment release (cf release). Closes scylladb/scylladb#25719	2025-08-30 08:24:57 +02:00
Piotr Dulikowski	7ccb50514d	Merge 'Introduce view building coordinator' from Michał Jadwiszczak This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views. The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table. The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built. The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results. This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch. The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation. If the operation fails at any moment, aborted tasks are rollback. The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149). For detailed description check: `docs/dev/view-building-coordinator.md` Fixes https://github.com/scylladb/scylladb/issues/22288 Fixes https://github.com/scylladb/scylladb/issues/19149 Fixes https://github.com/scylladb/scylladb/issues/21564 Fixes https://github.com/scylladb/scylladb/issues/17603 Fixes https://github.com/scylladb/scylladb/issues/22586 Fixes https://github.com/scylladb/scylladb/issues/18826 Fixes https://github.com/scylladb/scylladb/issues/23930 --- This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942 Closes scylladb/scylladb#23760 * github.com:scylladb/scylladb: test/cluster: add view build status tests test/cluster: add view building coordinator tests utils/error_injection: allow to abort `injection_handler::wait_for_message()` test: adjust existing tests utils/error_injection: add injection with `sleep_abortable()` db/view/view_builder: ignore `no_such_keyspace` exception docs/dev: add view building coordinator documentation db/view/view_building_worker: work on `process_staging` tasks db/view/view_building_worker: register staging sstable to view building coordinator when needed db/view/view_building_worker: discover staging sstables db/view/view_building_worker: add method to register staging sstable db/view/view_update_generator: add method to process staging sstables instantly db/view/view_update_generator: extract generating updates from staging sstables to a method db/view/view_update_generator: ignore tablet-based sstables db/view/view_building_coordinator: update view build status on node join/left db/view/view_building_coordinator: handle tablet operations db/view: add view building task mutation builder service/topology_coordinator: run view building coordinator db/view: introduce `view_building_coordinator` db/view/view_building_worker: update built views locally db/view: introduce `view_building_worker` db/view: extract common view building functionalities db/view: prepare to create abstract `view_consumer` message/messaging_service: add `work_on_view_building_tasks` RPC service/topology_coordinator: make `term_changed_error` public db/schema_tables: create/cleanup tasks when an index is created/dropped service/migration_manager: cleanup view building state on drop keyspace service/migration_manager: cleanup view building state on drop view service/migration_manager: create view building tasks on create view test/boost: enable proxy remote in some tests service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()` service/migration_manager: coroutinize `prepare_new_view_announcement()` service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine` service: reload `view_building_state_machine` on group0 apply() service/vb_coordinator: add currently processing base db/system_keyspace: move `get_scylla_local_mutation()` up db/system_keyspace: add `view_building_tasks` table db/view: add view_building_state and views_state db/system_keyspace: add method to get view build status map db/view: extract `system.view_build_status_v2` cql statements to system_keyspace db/system_keyspace: move `internal_system_query_state()` function earlier db/view: ignore tablet-based views in `view_builder` gms/feature_service: add VIEW_BUILDING_COORDINATOR feature	2025-08-29 17:28:44 +02:00
Aleksandra Martyniuk	7fe1ad1f63	tasks: return task::impl from make_and_start_task Currently, make_and_start_task returns a pointer to task_manager::task that hides the implementation details. If we need to access the implementation (e.g. because we want a task to "return" a value), we need to make and start task step by step openly. Return task_manager::task::impl from make_and_start_task. Use it where possible. Fixes: https://github.com/scylladb/scylladb/issues/22146.	2025-08-29 17:12:07 +02:00
Aleksandra Martyniuk	0844a057d1	compaction: use current_task_type	2025-08-29 17:08:00 +02:00
Aleksandra Martyniuk	33a547e740	test: check that repair with outdated session_id fails	2025-08-29 17:00:48 +02:00
Aleksandra Martyniuk	8f967cde5c	service: pass current session_id to repair rpc Currently, in repair_tablet we retrieve session_id from tablet map (and throw if it isn't specified). In case of topology coordinator failover, we may end up in a situation where a node runs outdated repair, treating session of a different operation as the repair's session: - topology coordinator starts repair transition (A); - topology coordinator sends tablet repair rpc to node1; - topology coordinator is separated from the cluster; - new topology coordinator is elected; - new topology coordinator sees waiting repair request (A_2) and executes it; - new repair of the same tablet is requested (B); - new topology coordinator starts repair transition (B); - new topology coordinator sends tablet repair rpc to node2; - node2 starts repair (B) as repair master; - node1 starts repair (A), checks the current session (B), proceeds with repair (B) as repair master. Send current session_id in repair_tablet rpc. If this session_id and session id got from tablet map don't match, an exception is thrown.	2025-08-29 16:46:52 +02:00
Łukasz Paszkowski	e34deea50e	tests/cluster: Add new storage tests The storage submodule contains tests that require mounted volumes to be executed. The volumes are created automatically with the `volumes_factory` fixture. The tests in this suite are executed with the custom launcher `unshare -mr pytest` Test scenarios (when one node reaches critical disk utilization): 1. Reject user table writes 2. Disable/Enabled compaction 3. Reject split compactions 4. New split compactions not triggered 5. Abort tablet repair 6. Disable/Enabled incoming tablet migrations 7. Restart a node while a tablet split is triggered	2025-08-29 14:56:13 +02:00
Łukasz Paszkowski	4bb5696a5d	test/scylla_cluster: Override workdir when passed via cmdline Currently, workdir is set in ScyllaCluster constructor and it does not take into accout that the value could be overridden via cmdline arguments. When this happens, then some data (logs, configs) are stored under one path and other (data) is stored under a different. The patch allows overriding the value when passed via cmdline arguments leading to all files being stored under the same path.	2025-08-29 14:56:13 +02:00
Łukasz Paszkowski	7cfedb1214	streaming: Reject incoming migrations When a replica operates in the critical disk utilization mode, all incoming migrations are being rejected by rejecting an incoming sstable file. In the topology_coordinator, the rejected tablet is moved into the cleanup_target state in order to revert migration. Otherwise, retry happens and a cluster stays in the tablet_migration transition state preventing any other topology changes to happen, e.g. scaling out. Once the tablet migration is rejected, the load balancer will schedule a new migration.	2025-08-29 14:56:13 +02:00
Łukasz Paszkowski	54201960e6	storage_service: extend locator::load_stats to collect per-node critical disk utilization flag This commit extends the TABLE_LOAD_STATS RPC with information whether a node operates in the critical disk utilization mode. This information will be needed to distict between the causes why a table migration/repair was interrupted.	2025-08-29 14:56:13 +02:00
Łukasz Paszkowski	9809800aa8	repair_service: Add a facility to disable the service Repair service currently have two functions: stop() and shutdown() that stop all ongoing repairs and prevent any further repairs from being started. It is possible to stop the repair_service once. Once stopped, it cannot be restarted. We would like, however, to enable / disable the repair service many times. Similarly to compaction_manager, the repair service provides two new functions: - drain() - abort all ongoing local repair task and disable the service, i.e. no new local task will be scheduled and data received from the repair master is rejected. It's, though, still possible to schedule a global repair request - enable() - enable the service By default, the repair service is enabled immediately once started. For tablet-based keyspaces, the new facility prevents tablets from being repaired. Whenever the repair_service is disabled and the request to repair a tablet arrives, an exception is returned. Once the exception is thrown, the tablet is moved into the end_repair state and the operation will be retried later. Hence, disabling the service does not fail the global tablet repair request.	2025-08-29 14:56:13 +02:00
Łukasz Paszkowski	9539e80e54	compaction_manager: Subscribe to out of space controller	2025-08-29 14:56:07 +02:00
Aleksandra Martyniuk	f3b43b6384	repair: add new param to tablet_repair_task_impl Currently, sched_info is set immediately after tablet_repair_task_impl is created. Pass this param to constructor instead. It's a preparation for the following changes.	2025-08-29 14:37:00 +02:00
Aleksandra Martyniuk	57b47e282e	repair: add new params to shard_repair_task_impl Currently, neighbors and small_table_optimization_ranges_reduced_factor are set immediately after shard_repair_task_impl is created. Pass these params to constructor instead. It's a preparation for following changes.	2025-08-29 14:27:00 +02:00
Aleksandra Martyniuk	6a0d8728de	repair: pass argument by value shard_repair_task_impl constructor gets some of its arguments by const reference. Due to that those arguments are copied when they could be moved. Get shard_repair_task_impl constructor arguments by value. Use std::move where possible.	2025-08-29 14:24:47 +02:00
Dawid Mędrek	90a2e0d1cc	cdc/generation: Delete copy constructors of topology_description The object might be quite big and lead to reactor stalls. Using its copy constructor is asking for trouble, so let's explicitly delete it.	2025-08-29 13:54:16 +02:00
Dawid Mędrek	508e00319b	cdc/generation: Clone topology_description asynchronously An instance of `cdc::topology_description` can be quite big. The vector it consists of stores as many `token_range_description`s as there are vnodes, and the size of each `token_range_description` is O(#shards). Because of that, copying an instance of the type can lead to reactor stalls. To prevent that, we introduce an asynchronous function copying the contents on the object. Reactor stalls were detected in the call to `map_reduce` in `generation_service::legacy_do_handle_cdc_generation`, so let's start using the new function there. A similar scenario occurs in `generation_service::handle_cdc_generation`, so we modify it too. Unfortunately, it doesn't seem viable to provide a reproducer of said problem. Fixes scylladb/scylladb#24522	2025-08-29 13:54:00 +02:00
Łukasz Paszkowski	40c40be8a6	compaction_manager: Replace enabled/disabled states with running state Using a single state variable to keep track whether compaction manager is enabled/disabled is insufficient, as multiple services may independently request compactions to be disabled. To address this, a counter is introduced to record how many times the compaction manager has been drained. The manager is considered enabled only when this counter reaches zero. Introducing a counter, enabled and disabled states become obsolete. So they are replaced with a single running state.	2025-08-29 13:47:01 +02:00
Łukasz Paszkowski	3d03b88719	database: Add critical_disk_utilization mode database can be moved to When database operates in the critical disk utilization mode, all mutation writes including inserts, updates, deletes, counter updates, hints, read+repair, lwt writes) to user tables and other associated with them tables like views, CDC log, audit are rejected, with a clear error exception returned. The mode is meant to be used with the disk space monitor in order to prevent any user writes when node's disk utilization is too high.	2025-08-29 13:46:45 +02:00
Dawid Pawlik	a70086c781	create_index_statement: rename `validator` to `custom_index_factory` The change is motivated by the fact that indeed the result of `get_custom_class_factory` is a `custom_index_factory`. The name `validator` was a bit misleading as it does not validate anything by itself. Furthermore if we wanted to use the custom index produced by the factory in other operations than validate, the name feels really off.	2025-08-29 10:49:15 +02:00
Dawid Pawlik	873d7dba5c	custom index: rename `custom_index_option_name` Renamed `custom_index_option_name` to `custom_class_option_name` as the late was a bit misleading since we refactored our model of custom indexes to be index class reliant.	2025-08-29 10:49:15 +02:00
Dawid Pawlik	18e4b9d989	vector_index: rename `supported_options` to `vector_index_options` There are a few types of index options abstraction in a code. One is `raw_options` which indicates the options provided by the user via CQL. Another is `options` which includes the real index options after correction checks and addition of system-set options. I believe we do not need another abstraction with undescriptive name. This patch adds a little neatness, describing what should the developer understand by looking at the `supported_options`. This options are only provided for the vector index to setup the external index properly with parameters strongly related to Vector Search.	2025-08-29 10:47:02 +02:00
Lakshmi Narayanan Sreethar	ce0c29e024	types/comparable_bytes: add compatibility testcases for collection types This patch adds compatibility testcases for the following cql3 types : set, list, map, tuple, vector and reversed types. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	4547f6f188	types/comparable_bytes: update compatibility testcase to support collection types The `abstract_type::from_string()` method used to parse the input data doesn't support collections yet. So the collection testdata will be passed as JSON strings to the testcase. This patch updates the testcase to adapt to this workaround. Also, extended the testcase to verify that Scylla's implementation can successfully decode the byte comparable output encoded by Cassandra. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	0997b3533c	types/comparable_bytes: support empty type Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	b799101a09	types/comparable_bytes: support reversed types A reversed type is first encoded using the underlying type and then all the bits are flipped to ensure that the lexicographical sort order is reversed. During decode, the bytes are flipped first and then decoded using the underlying type. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	6c2a3e2c51	types/comparable_bytes: support vector cql3 type The CQL vector type encoding is similar to the lists, where each element is transformed into a byte-comparable format and prefixed with a component marker. The sequence is terminated with a terminator marker to indicate the end of the collection. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	1ccfe522f1	types/comparable_bytes: support tuple and UDT cql3 type The CQL tuple and UDT types share the same internal implementation and therefore use the same byte comparable encoding. The encoding is similar to lists, where each element is transformed into a byte-comparable format and prefixed with a component marker. The sequence is terminated with a terminator marker to indicate the end of the collection. TODO: Add duplicate test items to maps, lists and sets For maps, add more entries that share keys ex map1 : key1 : value1, key2 : value2 map2 : key1 : value4 map3 : key2 : value5 etc Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	ca38c15a97	types/comparable_bytes: support map cql3 type The CQL map type is encoded as a sequence of key-value pairs. Each key and each value is individually prefixed with a component marker, and the sequence is terminated with a terminator marker to indicate the end of the collection. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	4d5e5f0c84	types/comparable_bytes: support set and list cql3 types The CQL set and list types are encoded as a sequence of elements, where each element is transformed into a byte-comparable format and prefixed with a component marker. The sequence is terminated with a terminator marker to indicate the end of the collection. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	8e46e8be01	types/comparable_bytes: introduce encode/decode_component The components of a collection, such as an element from a list, set, or vector; a key or value from a map; or a field from a tuple, share the same encode and decode logic. During encode, the component is transformed into the byte comparable format and is prefixed with the `NEXT_COMPONENT` marker. During decode, the component is transformed back into its serialized form and is prefixed with the serialized size. A null component is encoded as a single `NEXT_COMPONENT_NULL` marker and during decode, a `-1` is written to the serialized output. This commit introduces few helper methods that implement the above mentioned encode and decode logics. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:21 +05:30
Lakshmi Narayanan Sreethar	47e88be6e0	types/comparable_bytes: introduce to_comparable_bytes/from_comparable_bytes Added helper functions to_comparable_bytes() and from_comparable_bytes() to let collection encode/decode methods invoke encode/decode of the underlying types. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:09 +05:30
Dario Mirovic	8120709231	transport: replace `make_frame` throw with return result `cql_transport::response::make_frame` used to throw `protocol_exception`. With this change it will return `result_with_exception_ptr<sstring>` instead. Code changes are propagated to `cql_transport::cql_server::response::make_message` and from there to `cql_transport::cql_server::connection::write_response`. `write_response` continuation calling `make_message` used to transform the exception from `make_message` to an exception future, and now the logic stays the same, just explicitly stated at this code layer, so the behavior is not changed. Refs: #24567	2025-08-28 23:33:33 +02:00
Dario Mirovic	ba178f4c85	cql3: remove throwing `protocol_exception` Remove throwing `protocol_exception` in cql3/query_options.cc` in function `cql3::query_options::check_serial_consistency` as part of an ongoing effort to remove throwing `protocol_exception`. This change only affects code local to the `cql3` module. Refs: #24567	2025-08-28 23:33:15 +02:00
Dario Mirovic	fc123f865e	transport: replace throw in validate_utf8 with result_with_exception_ptr return As part of the effort to replace `protocol_exception` throws, `validate_utf8` from `cql_transport::request_reader` throw is replaced with returning `utils::result_with_exception_ptr`. This change affects only the three places it is called from in the same file `transport/request.hh`. Refs: #24567	2025-08-28 23:32:28 +02:00
Dario Mirovic	51995af258	transport: replace throwing protocol_exception with returns Replace throwing `protocol_exception` with returning it as a result or an exceptional future in the transport server module. The goal is to improve performance. Most of the `protocol_exception` throws were made from `fragmented_temporary_buffer` module, by passing `exception_thrower()` to its `read*` methods. `fragmented_temporary_buffer` is changed so that it now accepts an exception creator, not exception thrower. `fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced `fragmented_temporary_buffer_concepts::ExceptionThrower` and all methods that have been throwing now return failed result of type `utils::result_with_exception_ptr`. This change is then propagated to the callers. The scope of this patch is `protocol_exception`, so commitlog just calls `.value()` method on the result. If the result failed, that will throw the exception from the result, as defined by `utils::result_with_exception_ptr_throw_policy`. This means that the behavior of commitlog module stays the same. transport server module handles results gracefully. All the caller functions that return non-future value `T` now return `utils::result_with_exception_ptr<T>`. When the caller is a function that returns a future, and it receives failed result, `make_exception_future(std::move(failed_result).value())` is returned. The rest of the callstack up to the transport server `handle_error` function is already working without throwing, and that's how zero throws is achieved. Fixes: #24567	2025-08-28 23:31:36 +02:00
Dario Mirovic	f01efd822e	utils: add result_with_exception_ptr Add `result_with_exception_ptr` result type. Successful result has user specified type. Failed result has std::exception_ptr. This approach is simpler than `result_with_exception`. It does not require user to pass exception types as variadic template through all the callstack. Specific exception type can still be accessed without costly std::rethrow_exception(eptr) by using `try_catch`, if configured so via `USE_OPTIMIZED_EXCEPTION_HANDLING`. This means no information loss, but less verbosity when writing result types. Refs: #24567	2025-08-28 23:31:04 +02:00
Łukasz Paszkowski	3e740d25b5	disk_space_monitor: add subscription API for threshold-based disk space monitoring Introduce the `subscribe` method to disk_space_monitor, allowing clients to register callbacks triggered when disk utilization crosses a configurable threshold. The API supports flexible trigger options, including notifications on threshold crossing and direction (above/below). This enables more granular and efficient disk space monitoring for consumers.	2025-08-28 18:06:37 +02:00
Łukasz Paszkowski	c2de678a87	docs: Add feature documentation 1. Adds user-facing page in /docs/troubleshooting/error-messages	2025-08-28 18:06:37 +02:00
Łukasz Paszkowski	535c901e50	config: Add critical_disk_utilization_level option The option defines the threshold at which the defensive mechanisms preventing nodes from running out of space, e.g. rejecting user writes shall be activated. Its default value is 98% of the disk capacity.	2025-08-28 18:06:37 +02:00
Łukasz Paszkowski	132fd1e3f2	replica/exceptions: Add a new custom replica exception The new exception `critical_disk_utilization_exception` is thrown when the user table mutation writes are being blocked due to e.g. reaching a critical disk utilization level. This new exception, is then correctly handled on the coordinator side when transforming into `mutation_write_failure_exception` with a meaningful error message: "Write rejected due to critical disk utilization".	2025-08-28 18:06:37 +02:00
Ferenc Szili	1b8a44af75	test: reproducer and test for drop with concurrent cleanup This change adds a reproducer and test for issue #25706	2025-08-28 16:51:36 +02:00
Ferenc Szili	a0934cf80d	truncate: check for closed storage group's gate in discard_sstables Consider the following scenario: - A tablet is migrated away from a shard - The tablet cleanup stage closes the storage group's async_gate - A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate - Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash This patch makes dicard_sstables check if the storage group's gate is closed whend checking for disabled compaction.	2025-08-28 16:51:25 +02:00
Petr Gusev	4b907c7711	storage_service: move get_host_id_to_ip_map to system_keyspace Reimplemented the function to use the peers cache. It could be replaced with get_ip_from_peers_table, but that would create a coroutine frame for each call.	2025-08-28 12:48:46 +02:00
Petr Gusev	de5dc4c362	system_keyspace: use peers cache in get_ip_from_peers_table The storage_service::on_change method can be called quite often by the gossiper, see scylladb/scylla-enterprise#5613. In this commit we introduce a temporal cache for system.peers so that we don't have to go to the storage each time we need to resolve host_id -> ip. We keep the cache only for a small amount of time to handle the (unlikely) scenario when the user wants to update system.peers table from CQL. Fixes scylladb/scylladb#25660	2025-08-28 12:48:39 +02:00
Avi Kivity	46193f5e79	Merge 'service/qos: Modularize service level controller to avoid invalid access to auth::service' from Dawid Mędrek Move management over effective service levels from `service_level_controller` to a new dedicated type -- `auth_integration`. Before these changes, it was possible for the service level controller to try to access `auth::service` after it was deinitialized. For instance, it could happen when reloading the cache. That HAS happened as described in the following issue: scylladb/scylladb#24792. Although the problem might have been mitigated or even resolved in scylladb/scylladb@10214e13bd, it's not clear how the service will be used in the future. It's best to prevent similar bugs than trying to fix them later on. The logic responsible for preventing to access an uninitialized `auth::service` was also either non-existent, complex, or non-sufficient. To prevent accessing `auth::service` by the service level controller, we extract the relevant portion of the code to a separate entity -- `auth_integration`. It's an internal helper type whose sole purpose is to manage effective service levels. Thanks to that, we were able to nest the lifetime of `auth_integration` within the lifetime of `auth::service`. It's now impossible to attempt to dereference it while it's uninitialized. If a bug related to an invalid access is spotted again, though, it might also be easier to debug it now. There should be no visible change to the users of the interface of the service level controller. We strived to make the patch minimal, and the only affected part of the logic should be related to how `auth::service` is accessed. The relevant portion of the initialization and deinitialization flow: (a) Before the changes: 1. Initialize `service_level_controller`. Pass a reference to an uninitialized `auth::service` to it. 2. Initialize other services. 3. Initialize and start `auth::service`. 4. (work) 5. Stop and deinitialize `auth::service`. 6. Deinitialize other services. 7. Deinitialize `service_level_controller`. (b) After the changes: 1. Initialize `service_level_controller`. Pass a reference to an uninitialized `auth::service` to it. () 2. Initialize other services. 3. Initialize and start `auth::service`. 4. Initialize `auth_integration`. Register it in `service_level_controller`. 5. (work) 6. Unregister `auth_integration` in `service_level_controller` and deinitialize it. 7. Stop and deinitialize `auth::service`. 8. Deinitialize other services. 9. Deinitialize `service_level_controller`. (): The reference to `auth::service` in `service_level_controller` is still necessary. We need to access the service when dropping a distributed service level. Although it would be best to cut that link between the service level controller and `auth::service` too, effectively separating the entities, it would require more work, so we leave it as-is for now. It shouldn't prove problematic as far as accessing an uninitialized service goes. Trying to drop a service level at the point when we're de-initializing auth should be impossible. For more context, see the function `drop_distributed_service_level` in `service_level_controller`. A trivial test has been included in the PR. Although its value is questionable as we only try to reload the service level cache at a specific moment, it's probably the best we can deliver to provide a reproducer of the issue this patch is resolving. Fixes scylladb/scylladb#24792 Backport: The impact of the bug was minimal as it only affected the shutdown. However, since CI is failing because of it, let's backport the change to all supported versions. Closes scylladb/scylladb#25478 * github.com:scylladb/scylladb: service/qos: Move effective SL cache to auth_integration service/qos: Add auth::service to auth_integration service/qos: Reload effective SL cache conditionally service/qos: Add gate to auth_integration service/qos: Introduce auth_integration	2025-08-28 13:42:55 +03:00
Petr Gusev	91c633371e	storage_service: move get_ip_from_peers_table to system_keyspace We plan to add a cache to get_ip_from_peers_table in upcoming commits. It's more convenient to do this from system_keyspace, since the only two methods that mutate system.peers (remove_endpoint and update_peers_info) are already there.	2025-08-28 12:30:41 +02:00
Aleksandra Martyniuk	773ae73704	test: add compaction task progress test	2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk	f3ed852115	compaction: set progress unit for compaction tasks	2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk	046742bd18	compaction: find expected workload for reshard tasks Find expected workload in bytes of reshard tasks. The workload of table_resharding_compaction_task_impl is found at the beginning of its execution. Before that, expected_total_workload() returns std::nullopt, which means that the progress for this task won't be shown.	2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk	a2381380f2	compaction: find expected workload for global cleanup compaction tasks Sum bytes of all sstables of all non local vnode keyspaces.	2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk	cc9e342cb6	compaction: find expected workload for global major compaction tasks Sum bytes of all sstables of all keyspaces.	2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk	b12fc50de7	compaction: find expected workload for keyspace compaction tasks Add compaction_task_impl::get_keyspace_task_workload that sums the bytes in all sstables of this keyspace. This function is used to find the expected workload of the following keyspace compaction types: - major; - cleanup; - offstrategy; - upgrade_sstables; - scrub.	2025-08-28 12:10:07 +02:00
Aleksandra Martyniuk	82e241eb00	compaction: find expected workload for shard compaction tasks Add compaction_task_impl::get_shard_task_workload that sums the bytes in all sstables of this keyspace on this shard. This function is used to find the expected workload of the following shard compaction types: - major; - cleanup; - offstrategy; - upgrade_sstables; - scrub.	2025-08-28 12:04:16 +02:00
Ran Regev	515d9f3e21	docs: backup and restore feature added backup and restore as a feature to documentation Signed-off-by: Ran Regev <ran.regev@scylladb.com> Closes scylladb/scylladb#25608	2025-08-28 13:00:19 +03:00
Calle Wilund	2eccd17e70	system_keyspace: Limit parallelism in drop_truncation_records Fixes #25682 Refs scylla-enterprise#5580 If the truncation table is large in entries, we might create a huge parallel execution, quite possibly consuming loads of resources doing something quite trivial. Limit concurrency to a small-ish number Closes scylladb/scylladb#25678	2025-08-28 12:50:00 +03:00
Nadav Har'El	aa36430ff9	build: make patchelf executable much smaller In our recent binary distributions, we have a pretty big "patchelf" binary: -rwxr-xr-x. 1 nyh nyh 2.5M Jul 30 21:16 build/2025.3.0~rc2/libexec/patchelf Although 2.5 MB isn't what it used to be, it's still surprising that this tiny tool, that doesn't need any libraries beyond standard C++ (it doesn't use Seastar, Boost, or anything) can be this big. And 2.5 MB is over 1% of our entire "relocatable package" size, just for this silly patchelf tool :-( It turns out this was all just a mistake in our configure.py build system - patchelf was built by the exact same code that built the "scylla" executable (it is listed on the "apps" list just like Scylla), so it got links with a bazillion libraries - and in "release" build mode, some of this was against statically linked libraries. So in this patch I move patchelf from the "apps" list to a new list of "cpp_apps" - tools that need to be built with C++ but without libraries like Seastar or abseil or boost. After this patch, the 2.5 MB patchelf is down to just 30 KB. I verified that the Cmake-based build doesn't have this problem, so doesn't need fixing - it already builds patchelf with size around 30 KB. So this patch only needs to modify configure.py. Fixes #25472 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25476	2025-08-28 12:36:53 +03:00
Aleksandra Martyniuk	926753e8bf	compaction: find expected workload for table compaction tasks Add compaction_task_impl::get_table_task_workload that sums the bytes in all sstables in the table. This function is used to find the expected workload of the following compaction types: - major; - cleanup; - offstrategy; - upgrade_sstables; - scrub.	2025-08-28 10:41:22 +02:00
Dario Mirovic	587a877718	docs/cql: update documentation for default replication factor Update create-keyspace-statement section of ddl.rst since replication factor is no longer mandatory. Add an example for keyspace creation without specifying replication factor. Add an example for keyspace creation without specifying both `class` and replication factor. Refs: #16028	2025-08-28 01:42:34 +02:00
Dario Mirovic	fd84da7a50	test/cqlpy: add keyspace creation default replication factor tests Add test cases for create keyspace default replication factor. It is expected that the default replication factor is equal to the number of racks containing at least some non-zero-token nodes in the test suite. Refs: #16028	2025-08-28 01:42:34 +02:00
Dario Mirovic	ca5adf2ac1	cql3: add default replication factor to `create_keyspace_statement` When creating a new keyspace, replication factor must be stated. For example: `CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy', 'replication_factor': 3 };` This patch changes it in the following way - if there is no replication factor specified, use default replication factor. Default replication factor is equal to the number of racks that are not comprised of only zero-token nodes, i.e. racks that have at least one non-zero-token node. The following syntax is now valid: `CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy' };` `CREATE KEYSPACE ks WITH REPLICATION { };` Fixes: #16028	2025-08-28 01:42:29 +02:00
Radosław Cybulski	01bb7b629a	build: add precompiled headers to CMakeLists.txt Add precompiled header support to CMakeLists.txt and configure.py - it improves compilation time by approximately 10%. New header `stdafx.hh` is added, don't include it manually - the compiler will include it for you. The header contains includes from external libraries used by Scylla - seastar, standard library, linux headers and zlib. The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER` or configure.py --disable-precompiled-header to disable. The feature should be disabled, when trying to check headers - otherwise you might get false negatives on missing includes from seastar / abseil and so on. Note: following configuration needs to be added to ccache.conf: sloppiness = pch_defines,time_macros Closes #25182	2025-08-27 21:37:54 +03:00
Aleksandra Martyniuk	bd28c50d84	compaction: return empty progress when compaction_size isn't set Currently, progress of compaction task executors is reported in bytes. However, if compaction_size isn't set for compaction task executor, the executor's progress is shown as 1/1 (if it has finished) or 0/1 (otherwise). In the following patches, the progress of executors' parent task will be found based on its children. Hence, to avoid mixing different progress units, the binary progress is no longer used. Return empty progress when compaction_size isn't set. Drop task_manager::task::impl::get_binary_progress as it's no longer used.	2025-08-27 17:51:21 +02:00
Aleksandra Martyniuk	f78dbff814	compaction: update compaction_data::compaction_size at once Currently, in compaction::setup compaction_size is updated in a loop. Due to that the total progress of compaction executors grows during their execution. Add the sstables sizes to a compaction_size variable. Update compaction_data::compaction_size after the loop.	2025-08-27 17:50:36 +02:00
Aleksandra Martyniuk	836159b0c3	tasks: do not check expected workload for done task task_manager::task::impl::get_progress checks the expected total workload of a task to find its progress. If a task has finished successfully then its workload is equal to the sum of total progresses of its children. Do not call expected_total_workload for tasks that have finished successfully.	2025-08-27 17:48:25 +02:00
Dawid Mędrek	646f8bc4cd	cdc: Set tombstone_gc when creating log table Normally, when we create a table, MV, etc., we apply `cf_prop_defs` to the schema builder via the function `cf_prop_defs::apply_to_builder`. Unfortunately, that didn't happen when creating CDC log tables, and so we might have missed some of the properties that would normally be set to some value, even if the default one. One particular example of that phenomenon was `tombstone_gc`. For better or worse, it's not a "standalone property" of a table, but rather part of `extensions`. [Somewhat related issue: scylladb/scylladb#9722] That may have and did cause trouble. Consider this scenario: 1. A CDC log table is created. 2. The table does NOT have any value of `tombstone_gc` set. 3. The user edits the table via `ALTER TABLE`. That statement treats the log table just like any other one (at least as far as the relevant portion of the logic is concerned). Among other things, it uses `cf_prop_defs::apply_to_builder`, and as a result, the `tombstone_gc` property is set to some value: * the default one if the user doesn't specify it in the statement, * a custom one if they do. Why is that a problem? First of all, it's confusing. When we perform a schema backup and a table uses CDC, we include an ALTER statement for its corresponding CDC log table (for more context, see issue scylladb/scylladb#18467 or commit scylladb/scylladb@f12edbdd95). There are two consequences for the user here: 1. If the log table had NOT been altered ever since it was created, the statement will miss the `tombstone_gc` property as if it couldn't be set for it at all. That's confusing! 2. If the log table HAD in fact been altered after its creation, the statement will include the `tombstone_gc` property. That's even more confusing (why was it not present the first time, but it is now?). The `tombstone_gc` property should always be set to avoid confusion and problematic edge cases in tests and to simply be consistent with how other schema entities work. The solution we employ is that we always set the property to the default value. That includes the case when we reattach the log table to the base; consider the following scenario: 1. Create a table with CDC enabled. 2. Detach the log table by performing `ALTER TABLE ... WITH cdc = {'enabled': false}`. 3. Change the `tombstone_gc` property of the log table. 4. Reattach the log table to the base in the same way as in step 2. The expected result would be that the new value of `tombstone_gc` would be preserved after reattaching the log table. However, that's not what will happen. We decide to stay consistent with how other properties of a log table behave, and we reset them after every reattachment. We might change that in the future: see issue scylladb/scylladb#25523. Two reproducer tests of scylladb/scylladb#25187 are included in the changes. Fixes scylladb/scylladb#25187	2025-08-27 13:18:41 +02:00
Anna Mikhlin	03b127082d	trigger scylla-ci Jenkins job by command trigger Scylla-CI-Route job that will trigger the scylla-ci jenkins job with the relevant params by specific command: `'@scylladbbot trigger-ci' Fixes: https://scylladb.atlassian.net/browse/PKG-2 Closes scylladb/scylladb#25695	2025-08-27 14:12:28 +03:00
Dawid Mędrek	2229060992	tombstone_gc: Add overload of get_default_tombstone_gc_mode We add a new overload of the function to avoid accessing the information about a keyspace via `data_dictionary::database` or `replica::database`. We motivate the change by the fact that there are situations when that piece of information might not be available: for instance, Alternator tables reside in separate keyspaces created specifically for them. When we create one, the mutations corresponding to creating the keyspace and the table must be applied together to ensure atomicity. Because of that, during the creation of the table, we will not be able to learn anything about the keyspace as it doesn't exist yet. That scenario is the actual motivation for this commit, and it is a prerequisite for upcoming changes in creation of CDC log tables. For more context on that problem, see issue: scylladb/scylladb#25187.	2025-08-27 13:00:10 +02:00
Dawid Mędrek	fd4e577db0	tombstone_gc: Rename get_default_tombstonesonte_gc_mode The previous identifier was probably a typo that was missed.	2025-08-27 13:00:10 +02:00
Nadav Har'El	401b04f9ea	vector-store: disambiguate call to format() As explained in commit `3e84d43f93` two years ago, using just format() instead of seastar::format() or fmt::format() is frowned upon, because it can cause ambiguities - resulting in compile errors - in some cases. The compile errors seem to crop up randomly depending on the exact version of fmt used, so build can work CI using one specific version, but fail on a developer's machine using a different version. In this patch I fix one such ambiguity that breaks compilation on my development machine's fmt-11.0.2 and clang 19.1.7, but works fine on the slightly newer frozen toolchain. The error I get before this fix is: service/vector_store_client.cc:261:39: error: call to 'format' is ambiguous 261 \| throw configuration_exception(format("Invalid Vector Store service URI: {}", uri)); \| ^~~~~~ Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25691	2025-08-27 13:54:33 +03:00
Nadav Har'El	0a990d2a48	config: split tri_mode_restriction to a separate header Today, any source file or header file that wants to use the tri_mode_restriction type needs to include db/config.hh, which is a large and frequently-changing header file. In this patch we split this type into a separate header file, db/tri_mode_restriction.hh, and avoid a few unnecessary inclusions of db/config.hh. However, a few source files now need to explicitly include db/config.hh, after its transitive inclusion is gone. Note that the overwhelmingly common inclusion of db/config.hh continues to be a problem after this patch - 128 source files include it directly. So this patch is just the first step in long journey. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25692	2025-08-27 13:47:04 +03:00
Michał Jadwiszczak	39db90a535	test/cluster: add view build status tests	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	f7ebc7b054	test/cluster: add view building coordinator tests	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	90b5b2c5f5	utils/error_injection: allow to abort `injection_handler::wait_for_message()`	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	cf138da853	test: adjust existing tests - Disable tablets in `test_migration_on_existing_raft_topology`. Because views on tablets are experimental now, we can safely assume that view building coordinator will start with view build status on raft. - Add error injection to pause view building on worker. Used to pause view building process, there is analogous error injection in view_builder. - Do a read barrier in `test_view_in_system_tables` Increases test stability by making sure that the node sees up-to-date group0 state and `system.built_views` is synced. - Wait for view is build in some tests Increases tests stability by making sure that the view is built. - Remove xfail marker from `test_tablet_streaming_with_unbuilt_view` This series fix https://github.com/scylladb/scylladb/issues/21564 and this test should work now.	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	6056b55309	utils/error_injection: add injection with `sleep_abortable()`	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	1e2fa069df	db/view/view_builder: ignore `no_such_keyspace` exception	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	c4288aa1f8	docs/dev: add view building coordinator documentation	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	454033a773	db/view/view_building_worker: work on `process_staging` tasks	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	233f4dcee3	db/view/view_building_worker: register staging sstable to view building coordinator when needed Change return type of `check_needs_view_update_path()`. Instead of retrning bool which tells whether to use staging directory (and register to `view_update_generator`) or use normal directory. Now the function returns enum with possible values: - `normal_directory` - use normal directory for the sstable - `staging_directly_to_generator` - use staging directory and register to `view_update_generator` - `staging_managed_by_vbc` - use staging directory but don't register it to `view_update_generator` but create view building tasks for later The third option is new, it's used when the table has any view which is in building process currrently. In this case, registering it to `view_update_generator` prematurely may lead to base-view inconsistency (for example when a replica is in a pending state).	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	f61039cbfd	db/view/view_building_worker: discover staging sstables When starting view_building_worker, go through all staging sstables for tablet-tables and register them locally. If there is no associated view building tasks for any sstable, create the task.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	651827cdab	db/view/view_building_worker: add method to register staging sstable The method will be used when a new staging sstable needs to go through the view building coordinator (the coordinator will decide when to process this staging sstable). Callers push new staging sstables to a queue and notifiy the async fiber to create `view_building_task`s from the sstables and commit them to group0.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	9de35ed2a2	db/view/view_update_generator: add method to process staging sstables instantly When the view building coordinator is sending `process_staging` task, we want to skip view_update_generator's staging sstables loop and process them instantly.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	2516993c70	db/view/view_update_generator: extract generating updates from staging sstables to a method	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	46507f76a6	db/view/view_update_generator: ignore tablet-based sstables Staging sstables for tablet-based tables are now handled by view_building_worker, so they need to be ignored by the generator.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	201c4fafec	db/view/view_building_coordinator: update view build status on node join/left Copy view build status for new node for tablet views and remove relevant statuses when a node is leaving the cluster.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	b855786804	db/view/view_building_coordinator: handle tablet operations If the view building coordinator is running, adjust view_building_tasks in case of tablet operations. The mutations are generated in the same batch as tablet mutations. At the start of tablet migration/resize/RF change, started view building tasks are aborted (by setting ABORTED state) if needed. Then, new adjusted tasks are created in group0 batch which ends the tablet operation and aborted tasks are removed from the table. In case the tablet operation fails or is revoked, aborted view building tasks are rollback by creating new copies of them and aborted ones are deleted from the table. View building tasks are not aborted/changed during tablet repair, because in this case, even if vb task is started, a staging sstable will be generated.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	56df5acd77	db/view: add view building task mutation builder	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	9312cd83c5	service/topology_coordinator: run view building coordinator Run view building coordinator alongside topology coordinator once the feature is available.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	08c9e6b9bb	db/view: introduce `view_building_coordinator` The coordinator is responsible for building tablet-based views. It schedules tasks for `view_building_worker` and updates views' statuses. The tasks are scheduled in a way that one shard is processing only one tablet at most (there may be multiple tasks since a base table may have multiple views). Support for tablet operations will be added in next commits.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	2b3e1682d7	db/view/view_building_worker: update built views locally Because `system.built_views` is a node-local table, we cannot mark a view as built directly from the view building coordinator. Instead, view building worker looks at data from `syste.view_build_status_v2` and updates `built_views` table accordingly.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	c9e710dca3	db/view: introduce `view_building_worker` The worker is responsible for building tablet-based views by executing tasks scheduled by the view building coordinator. It observes view building state machine and wait on the machine's conditional variable (so the worker is woken up when group0 state is applied). The tasks are executed in batches, all tasks in one batch need to have the same: type, base_id, table_id. One shard can only execute one batch at a time (at least for now, in the future we might want to change that). That worker keeps track of finished and failed tasks in its local state. The state is cleared when `view_building_state::currently_processed_base_table` is changed.	2025-08-27 10:22:59 +02:00
Emil Maskovsky	cfc87746b6	storage: pass host_id as parameter to `maybe_reconnect_to_preferred_ip()` Previously, `maybe_reconnect_to_preferred_ip()` retrieved the host ID using `gossiper::get_host_id()`. Since the host ID is already available in the calling function, we now pass it directly as a parameter. This change simplifies the code and eliminates a potential race condition where `gossiper::get_host_id()` could fail, as described in scylladb/scylla#25621. Refs: scylladb/scylla#25621 Backport: Recommended for 2025.x release branches to avoid potential issues from unnecessary calls to `gossiper::get_host_id()` in subscribers. Closes scylladb/scylladb#25662	2025-08-27 10:35:46 +03:00
Michał Jadwiszczak	a59624c604	db/view: extract common view building functionalities Extract common methods of view builder consumer to an abstract class and `flush_base()` and `make_partition_slice()` functions, so they can be used in view builder (vnode-based views) and view building consumer (tablet-based views; introduced in the next commit).	2025-08-27 08:55:48 +02:00
Michał Jadwiszczak	f71594738e	db/view: prepare to create abstract `view_consumer` In next commit, I'm going to introduce `view_building_worker::consumer`, with very similar functionalities to `view_builder::consumer` but it'll only consume range of one tablet per execution. Since most functions are very similar, I'll create abstract `view_consumer` which will be base for both of the consumers. In order to make the transition more readable, this commit prepares the `view_builder::consumer` by making some functions virtual and next commit will extract part of functions to the abstract class.	2025-08-27 08:55:48 +02:00
Michał Jadwiszczak	e901b6fde4	message/messaging_service: add `work_on_view_building_tasks` RPC The RPC will be used by view building coordinator to attach to and wait for tasks performed by view building worker (introduced in later commit). The RPC gets vector of tasks' ids and returns vector of `view_task_result`s. i-th task result reffers to i-th task id.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	7c73792194	service/topology_coordinator: make `term_changed_error` public View building coordinator may also throw `term_changed_error`.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	6e3e287a39	db/schema_tables: create/cleanup tasks when an index is created/dropped Similarly as in previous commits, create view building tasks when an index is created and cleanup view building status when it's dropped.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	76caaea3f1	service/migration_manager: cleanup view building state on drop keyspace When a keyspace is dropped, remove all unfinished building tasks for all views and remove their entries from `system.view_built_status_v2` and `system.built_views`.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	f10c5c4493	service/migration_manager: cleanup view building state on drop view When a view is dropped, remove all unfinished building tasks, remove entries from `system.view_built_status_v2` and `system.built_views`. If the view is currently being built, removing its tasks means they are also aborted. Finished tasks are already removed from the table.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	6d1fbf06ed	service/migration_manager: create view building tasks on create view Create view building tasks in the same batch as new view mutations. The tasks are created only if `VIEW_BUILDING_COORDINATOR` feature is on and the view is in tablet keyspace.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	19651b4978	test/boost: enable proxy remote in some tests After a few next patches, creating/dropping a view in tablet keyspace will require a remote proxy to obtain references to system keyspace and view building state. Because of this, remote proxy needs to be explicitly enabled in boost tests which create views.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	204f61ffe1	service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()` The reference is needed to get `view_building_state_machine`.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	76a6dd82fd	service/migration_manager: coroutinize `prepare_new_view_announcement()`	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	d2e1b6d44a	service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine` Those references are needed to manage view building tasks while a view is created/dropped.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	f2e7051a84	service: reload `view_building_state_machine` on group0 apply() The state may be also reloaded on `topology_change` or `mixed_change` because topology coordinator may change view building tasks during tablet operations.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	d5d81591db	service/vb_coordinator: add currently processing base The view building coordinator will be building all views of one base table at a time. Select first available base table as currently processing base and save this information to `system.scylla_local`.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	e0184377ca	db/system_keyspace: move `get_scylla_local_mutation()` up So it can be used by all helper functions, keeping the logical order from the header file.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	46a24d960d	db/system_keyspace: add `view_building_tasks` table The table is managed by group0 and uses schema commitlog. The commit also includes helper functions.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	44890a52a8	db/view: add view_building_state and views_state `view_building_state` holds mapping of `view_building_task`s for tablet-based views. The structure is a memory representation of data stored in group0 tables. `views_state` holds information about tablet-based views and their build status.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	ce1890e512	db/system_keyspace: add method to get view build status map	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	f90dd522df	db/view: extract `system.view_build_status_v2` cql statements to system_keyspace Until now, all changes to `system.view_build_status_v2` were made from view.cc and the file contained all of the helper methods. This commit introduces a `build_status` enum class to avoid using hardcoded strings and extracts the helper methods to `system_keyspace` class, so they can be later used by the view building coordinator.	2025-08-27 08:55:46 +02:00
Michał Jadwiszczak	18609adfce	db/system_keyspace: move `internal_system_query_state()` function earlier So it can be used in all system keyspace proxy methods while maintaining the same order as in the header file.	2025-08-27 08:55:46 +02:00
Michał Jadwiszczak	d0826e7cb1	db/view: ignore tablet-based views in `view_builder` View building of tablet-based views will be handled by the view building coordinator later in this patch.	2025-08-27 08:55:46 +02:00
Michał Jadwiszczak	7dba3667c9	gms/feature_service: add VIEW_BUILDING_COORDINATOR feature	2025-08-27 08:55:46 +02:00
Nadav Har'El	e2c99436cf	Merge 'cdc, vector_search: enable CDC when the index is created' from Dawid Pawlik When a vector index is created in Scylla, it is initially built using a full scan of the database. After that, it stays up to date by tracking changes through CDC, which should be automatically enabled when the vector index is created. When a user attempts to enable Vector Search (VS), the system checks whether Change Data Capture (CDC) is enabled and properly configured: 1. CDC is not enabled - CDC is automatically enabled with the minimum required TTL (Time-to-Live) for VS (24 hours) and the delta mode set to 'full' or post-image is enabled. - If the user later tries to reduce the CDC TTL below 24 hours or set delta mode to 'keys' with post-image disabled, the action fails. - Error message: Clearly states that CDC TTL must be at least 24 hours and delta mode must be set to 'full' or post-image must be enabled for VS to function. 2. CDC is already enabled - If CDC TTL is ≥ 24 hours and delta mode is set to 'full' or post-image is enabled: VS is enabled successfully. - If CDC TTL is < 24 hours or delta mode is set to 'keys' with post-image disabled: The VS enabling process fails. - Error message: Informs the user that CDC TTL must be at least 24 hours, delta mode must be set to 'full' or post-image must be enabled, and provides a link to documentation on how to update the TTL, delta mode, and post-image. When a user attempts to disable CDC when VS is enabled, the action will fail and the user will be informed by error message that clearly states that VS needs to be disabled (vector indexes have to be dropped) first. Full setup requirements and steps will be detailed in the documentation of Vector Search. Co-authored-by: @smoczy123 Fixes: VECTOR-27 Fixes: VECTOR-25 Closes scylladb/scylladb#25179 * github.com:scylladb/scylladb: test/cqlpy: ensure Vector Search CDC options test/boost: adjust CDC boost tests for Vector Search test/cql: add Vector Search CDC enable/disable test cdc, vector_index: provide minimal option setup for Vector Search test/cqlpy: adjust describe table tests with CDC for Vector Search describe, cdc: adjust describe for cdc log tables cdc: enable CDC log when vector index is created test/cqlpy: run vector_index tests only on vnodes vector_index: check if vector index exists in schema	2025-08-26 23:01:32 +03:00
Dawid Mędrek	fc1c41536c	service/qos: Move effective SL cache to auth_integration Since `auth_integration` manages effective service levels, let's move the relevant cache from `service_level_controller` to it.	2025-08-26 18:41:48 +02:00
Dawid Mędrek	dd5a35dc67	service/qos: Add auth::service to auth_integration The new service, `auth_integration`, has taken over the responsibility over managing effective service levels from `service_level_controller`. However, before these changes, it still accessed `auth::service` via the service level controller. Let's change that. Note that we also remove a check that `auth::service` has been initialized. It's not necessary anymore because the lifetime of `auth_integration` is strictly nested within the lifetime of `auth::service`. In actuality, `service_level_controller` should lose its reference to `auth::service` completely. All of the management over effective service levels has already been moved to `auth_integration`. However, the referernce is still needed when dropping a distributed service level because we need to update the corresponding attribute for relevant roles. That should not lead to invalid accesses, though. Dropping a service level should not be possible when `auth::service` is not initialized.	2025-08-26 18:41:43 +02:00
Dawid Mędrek	e929279d74	service/qos: Reload effective SL cache conditionally Since `service_level_controller` outlives `auth_integration`, it may happen that we try to access it when it has already been deinitialized. To prevent that, we only try to reload or clear the effective service level cache when the object is still alive. These changes solve an existing problem with an invalid memory access. For more context, see issue scylladb/scylladb#24792. We provide a reproducer test that consistently fails before these changes but passes after them. Fixes scylladb/scylladb#24792	2025-08-26 18:41:40 +02:00
Dawid Mędrek	34afb6cdd9	service/qos: Add gate to auth_integration We add a named gate to `auth_integration` that will aid us in synchronizing ongoing tasks with stopping the service.	2025-08-26 18:41:37 +02:00
Dawid Mędrek	7d0086b093	service/qos: Introduce auth_integration We introduce a new type, `auth_integration`, that will be used internally by `service_level_controller`. Its purpose is to take over the responsibility over managing effective service levels. The main problem of the current implementation of service level controller is its dependency on `auth::service` whose lifetime is strictly nested within the lifetime of service level controller. That may and already have led to invalid memory accesses; for an example, see issue scylladb/scylladb#24792. Our strategy is to split service level controller into smaller parts and ensure that we access `auth::service` only when it's valid to do so. This commit is the first step towards that. We don't change anything in the logic yet, just add the new type. Further adjustments will be made in following commits.	2025-08-26 18:41:34 +02:00
Pavel Emelyanov	67b63768e4	api: Capture and use db in cache_service handlers Now the sharded<database>& argument is there, so it can replace ctx one on handlers lambdas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-26 11:50:11 +03:00
Pavel Emelyanov	596d4640ff	api: Add sharded<database>& arg to set_cache_service() The reference is already available in set_server_column_family(), pass it further so that "cache" handlers are able to use it (next patch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-26 11:49:35 +03:00
Pavel Emelyanov	4e556214ba	api: Squash (un)set_cache_service into ..._column_family The set_server_column_family() registers API handlers that work with replica::database. The set_server_cache() does the very same thing, but registers handlers with some other prefix. Squash the latter into former, later "cache" handlers will also make use of the database reference argument that's already available in ..._column_family() setter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-26 11:46:48 +03:00
Pavel Emelyanov	1b4b539706	api: Coroutinize set_server_column_family() To facilitate next patching Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-26 11:46:33 +03:00
Dario Mirovic	8b0a551177	test/cqlpy: add unknown compression algorithm test case Add `test_unknown_compression_algorithm` test case to `test_protocol_exceptions.py` test suite. This change improves test coverage for zero throws protocol exception handling. Refs: #24567	2025-08-25 13:31:40 +02:00
Avi Kivity	6dc2c42f8b	alternator: streams: refactor std::views::transform with side effect std::views::trasform()s should not have side effects since they could be called several times, depending on the algorithm they're paired with. For example, std::ranges::to() can run the algorithm once to measure the resulting container size, and then a second time to copy the data (avoiding reallocations). If that happens, then the side-effect happens twice. Avoid this be refactoring the code. Make the side-effect -- appending to the `column` vector -- happen first, then use that result to generate the `regular_column` vector. In this case, the side effect did not happen twice because small_vector's std::from_range_t constructor only reserves if the input range is sized (and it is not), but better not have the weakness in the code. Closes scylladb/scylladb#25011	2025-08-25 09:40:05 +03:00
Michael Litvak	25fb3b49fa	dist/docker: add dc and rack arguments add --dc and --rack commandline arguments to the scylla docker image, to allow starting a node with a specified dc and rack names in a simple way. This is useful mostly for small examples and demonstrations of starting multiple nodes with different racks, when we prefer not to bother with editing configuration files. The ability to assign nodes to different racks is especially important with RF=Rack enforcing. The previous method to achieve this is to set the snitch to GossipingPropertyFileSnitch and provide a configuration file in /etc/scylla/cassandra-rackdc.properties with the name of the dc and rack. The new dc and rack parameters are implemented similarly by using the snitch GossipingPropertyFileSnitch and writing the dc and rack values to the rackdc properties file. We don't support passing the parameters together with a different snitch, or when mounting a properties file from the host, because we don't want to overwrite it. Example: docker run -d --name scylla1 scylladb/scylla --dc my_dc1 --rack my_rack1 Fixes scylladb/scylladb#23423 Closes scylladb/scylladb#25607	2025-08-24 17:48:07 +03:00
Nadav Har'El	87dd96f9a2	Merge ' Alternator: DynamoDB compatible WCU Calculation via Read-Before-Write Support' from Amnon Heiman This series adds support for a DynamoDB-compatible Write Capacity Unit (WCU) calculation in Alternator by introducing an optional forced read-before-write mechanism. Alternator's model differs from DynamoDB, and as a result, some write operations may report lower WCU usage compared to what DynamoDB would report. While this is acceptable in many cases, there are scenarios where users may require accurate WCU reporting that aligns more closely with DynamoDB's behavior. To address this, a new configuration option, alternator_force_read_before_write, is introduced. When enabled, Alternator will perform a read before executing PutItem, UpdateItem, and DeleteItem operations. This allows it to take the existing item size into account when computing the WCU. BatchWriteItem support is also extended to use this mechanism. Because BatchWriteItem does not support returning old items directly, several internal changes were made to support reading previous item sizes with minimal overhead. Reads are performed at consistency level LOCAL_ONE for efficiency, and the WCU calculation is now done in multiple stages to accurately account for item size differences. In addition to the implementation changes, test coverage was added to validate the new behavior. These tests confirm that WCU is calculated based on the larger of the old and new items when read-before-write is active, including for BatchWriteItem. This feature comes with performance overhead and is therefore disabled by default. It can be enabled at runtime via the system.config table and should be used only when precise WCU tracking is necessary. New feature, no need to backport Closes scylladb/scylladb#24436 * github.com:scylladb/scylladb: alternator/test_returnconsumedcapacity.py: Test forced read before write alternator/executor.cc: DynamoDB WCU calculation in BatchWriteItem using read-before-write executor.cc: get_previous_item with consistency level executor: Extend API of put_or_delete_item alternator/executor.cc: Accurate WCU for put, update, delete config: add alternator_force_read_before_write	2025-08-24 11:38:24 +03:00
Avi Kivity	8815491085	treewide: include boost headers as "system" headers Boost is external to the project so treat its headers as "system" headers and include them with angle brackets. Closes scylladb/scylladb#25619	2025-08-22 17:21:24 +03:00
Piotr Dulikowski	5709d94826	Merge 'cql3: Warn when creating RF-rack-invalid keyspace' from Dawid Mędrek Although RF-rack-valid keyspaces are not universally enforced yet (they're governed by the configuration option `rf_rack_valid_keyspaces`), we'd like to encourage the user to abide by the restriction. To that end, we're introducing a warning when creating or altering a keyspace. If the configuration option is disabled, but the user is trying to create an RF-rack-invalid keyspace, they'll receive a warning. If the option is turned off, we will also log all of the RF-rack-invalid keyspaces at start-up. We provide validation tests. Fixes scylladb/scylladb#23330 Backport: we'd like to encourage the user to abide by the restriction even when they don't enforce it to make it easier in the future to adjust the schema when there's no way to disable it anymore. Because of that, we'd like to backport it to all relevant versions, starting with 2025.1. Closes scylladb/scylladb#24785 * github.com:scylladb/scylladb: main: Log RF-rack-invalid keyspaces at startup cql3/statements: Fix indentation cql3: Warn when creating RF-rack-invalid keyspace	2025-08-22 11:33:32 +02:00
Evgeniy Naydanov	ab15c94a09	test.py: dtest/commitlog_test: add test_pinned_cl_segment_doesnt_resurrect_data test_pinned_cl_segment_doesnt_resurrect_data was not moved in #24946 from scylla-dtest to this repo, because it's marked as xfail (#14879), but actually the issue is fixed and there is no reason to keep the test in scylla-dtest. Also remove unused imports. Closes scylladb/scylladb#25592	2025-08-22 11:30:10 +03:00
Raphael S. Carvalho	149f9d8448	replica: Fix race between drop table and merge completion handling Consider this: 1) merge finishes, wakes up fiber to merge compaction groups 2) drop table happens, which in turn invokes truncate underneath 3) merge fiber stops old groups 4) truncate disables compaction on all groups, but the ones stopped 5) truncate performs a check that compaction has been disabled on all groups, including the ones stopped 6) the check fails because groups being stopped didn't have compaction explicitly disabled on them To fix it, the check on step 6 will ignore groups that have been stopped, since those are not eligible for having compaction explicitly disabled on them. The compaction check is there, so ongoing compaction will not propagate data being truncated, but here it happens in the context of drop table which doesn't leave anything behind. Also, a group stopped is somewhat equivalent to compaction disabled on it, since the procedure to stop a group stops all ongoing compaction and eventually removes its state from compaction manager. Fixes #25551. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#25563	2025-08-22 10:19:43 +03:00
kendrick-ren	d6e62aeb6a	Update launch-on-gcp.rst Add the missing '=' mark in --zone option. Otherwise the command complains. Closes scylladb/scylladb#25471	2025-08-22 10:13:52 +03:00
Botond Dénes	3dcb596201	Merge 'test: properly unset recovery_leader in the recovery procedure tests' from Patryk Jędrzejczak After changing the type of the `recovery_leader` config option from `sstring` to `UUID` in #25032, setting `recovery_leader` to an empty string became an incorrect way to unset it. The following error started to appear in the recovery procedure tests: ``` init - marshaling error: UUID string size mismatch: '' : recovery_leader ``` We unset `recovery_leader` properly in this PR. To do it, we introduce a simple way to remove config options in tests. Backport is unneeded. This error was harmless, and Scylla ignored `recovery_leader` after logging the error as expected by the tests. Closes scylladb/scylladb#25365 * github.com:scylladb/scylladb: test: properly unset recovery_leader in the recovery procedure tests test: manager_client: allow removing a config option test: manager_client: add docstring to server_update_config	2025-08-22 10:09:37 +03:00
Benny Halevy	45c496c276	api: storage_service: fix token_range documentation Note that the token_range type is used only by describe_ring. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25609	2025-08-22 10:06:21 +03:00
Patryk Jędrzejczak	193a74576a	test/cluster/conftest: cluster_con: provide default values for port and use_ssl Some cluster tests use `cluster_con` when they need a different load balancing policy or auth provider. However, no test uses a port other than 9042 or enables SSL, but all tests must pass `9042, False` because these parameters don't have default values. This makes the code more verbose. Also, it's quite obvious that 9042 stands for port, but it's not obvious what `False` is related to, so there is a need to check the definition of `cluster_con` while reading any test that uses it. No reason to backport, it's only a minor refactoring. Closes scylladb/scylladb#25516	2025-08-22 09:51:24 +03:00
David Garcia	07d798a59d	docs: fix sidebar on local preview Closes scylladb/scylladb#25560	2025-08-22 09:50:07 +03:00
David Garcia	c3c70ba73f	docs: expose alternator metrics Renders in the docs some metrics introduced in https://github.com/scylladb/scylladb/pull/24046/files that were not being displayed in https://docs.scylladb.com/manual/stable/reference/metrics.html Closes scylladb/scylladb#25561	2025-08-22 09:49:52 +03:00
David Garcia	461a0bad8a	docs: do not show any version warning for upgrade guide pages Closes scylladb/scylladb#25562	2025-08-22 09:49:27 +03:00
Avi Kivity	e140bd6355	Update seastar submodule * seastar 1520326e6...0a90f7945 (13): > include: Keep linux-aio.hh deeper in internal > include: Move array_map.hh to util/internal/ > io_tester: Add support for scheduling supergroups > Merge 'tls: Force buffer splitting into gnutls record block sized chunks' from Calle Wilund > Add iotune --get-best-iops-with-buffer-sizes option > noncopyable_function: use memcpy instead of bytewise copy loop > test: Make fair_queue_test validation code use BOOST_CHECK_...-s > test: Rework test_fair_queue_random_run > Merge 'Remove capacity configuration for fair_queue tests' from Pavel Emelyanov > reactor: replace boost::barrier with std::barrier<> > rpc: server::process(): reindent > test: Remove no-op dispatch from fair_queue ticker > Merge 'json: API level 8: use noncopyable_function in json_return_type' from Benny (#2921) Closes scylladb/scylladb#25624	2025-08-22 09:41:02 +03:00
Andrzej Jackowski	86fc513bd9	auth: allow dropping roles in saslauthd_authenticator Before this change, `saslauthd_authenticator` prevented dropping roles. The current documentation instructs users to `Ensure Scylla has the same users and roles as listed in the LDAP directory`. Therefore, ScyllaDB should allow dropping roles so administrators can remove obsolete roles from both LDAP and ScyllaDB. The code change is minimal — dropping a role is a no-op, similar to the existing no-op implementations for successful `create` and `alter` operations. `saslauthd_authenticator_test` is updated to verify that dropping a role doesn't throw anymore. Fixes: scylladb/scylladb#25571 Closes scylladb/scylladb#25574	2025-08-22 09:40:44 +03:00
Yaron Kaikov	c0fd3deeab	github: Enhance label sync to support P0/P1 priority labels Extend the existing label synchronization system to handle P0 and P1 priority labels in addition to backport/* labels: - Add P0/P1 label syncing between issues and PRs bidirectionally - Automatically add 'force_on_cloud' label to PRs when P0/P1 labels are present (either copied from issues or added directly) The workflow now triggers on P0 and P1 label events in addition to backport/* labels, ensuring priority labels are properly reflected across the entire PR lifecycle. Refs: https://github.com/scylladb/scylla-pkg/issues/5383 Closes scylladb/scylladb#25604	2025-08-22 06:50:13 +03:00
Dawid Mędrek	837d267cbf	main: Log RF-rack-invalid keyspaces at startup When the configuration option `rf_rack_valid_keyspaces` is enabled and there is an RF-rack-invalid keyspace, starting a node fails. However, when the configuration option is disabled, but there still is a keyspace that violates the condition, we'd like Scylla to print a warning informing the user about the fact. That's what happens in this commit. We provide a validation test.	2025-08-21 19:35:33 +02:00
Dawid Mędrek	af8a3dd17b	cql3/statements: Fix indentation	2025-08-21 19:29:36 +02:00
Dawid Mędrek	60ea22d887	cql3: Warn when creating RF-rack-invalid keyspace Although RF-rack-valid keyspaces are not universally enforced yet (they're governed by the configuration option `rf_rack_valid_keyspaces`), we'd like to encourage the user to abide by the restriction. To that end, we're introducing a warning when creating or altering a keyspace. If the configuration option is disabled, but the user is trying to create an RF-rack-invalid keyspace, they'll receive a warning. We provide a validation test.	2025-08-21 19:29:33 +02:00
Evgeniy Naydanov	3a98331731	test.py: don't fail if use multiple tests from one dir in commandline There is the stash item REPEATED_FILES for directory items which used to cut recursion. But if multiple tests from one directory added to ./test.py commandline this solution prevents handling non-first tests well because it was already collected for the first one. Change behavior to not store all repeated files in the stash but just files which are in the process of repetition. Rename the stash item to REPEATING_FILES to reflect this change. Closes scylladb/scylladb#25611	2025-08-21 19:43:13 +03:00
Pawel Pery	509f5ddb89	vector_store_client: set keepalive for the http client's socket Checking for dead peers or network is helpful in maintaining a lifetime of the http client. This patch sets TCP_KEEPALIVE option on the http client's socket.	2025-08-21 16:22:30 +02:00
Pawel Pery	4b459c6855	vector_store_client: disable Nagle's algorithm on the http client Nagle’s algorithm and Delayed ACK’s algorithm are enabled by default on sockets in Linux. As a result we can experience 40ms latency on simply waiting for ACK on the client side. Disabling the Nagle’s algorithm (using TCP_NODELAY) should fix the issue (client won’t wait 40ms for ACKs). This patch sets `TCP_NODELAY` on every socket created by the `http_client`.	2025-08-21 16:20:26 +02:00
Dawid Pawlik	01e7a48030	vector_store_client: fix HTTP error message formatting Content of the HTTP error was logged in Scylla as literal list of chars (default temporary buffer formatting). Changed to print the sstring made out of temporary buffer, which fixes the problem with formatting, making the output clear and readable for humans. Fixes: VECTOR-141 Closes scylladb/scylladb#25329	2025-08-21 14:33:41 +02:00
Botond Dénes	09dc285b4a	Merge 'Remove redis from scylla source tree' from Ran Regev - remove redis documentation First, remove the redis documentation. - remove ./redis and dependencies Second, remove the redis directory and its dependencies from the project. Fixes: #25144 This is a cleanup, no need to backport. Closes scylladb/scylladb#25148 * github.com:scylladb/scylladb: remove ./redis and dependencies remove redis documentation	2025-08-21 14:26:11 +03:00
Benny Halevy	d6ca393928	test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints Following up on `6129411a5e` improve test_vnode_keyspace_describe_ring be verifying that the endpoints listed by describe_ring match those returned by the `natural_endpoints` api (for random tokens). The latter are calculated using an independent code path directly from the effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-21 11:48:17 +03:00
Benny Halevy	e34980ac87	test/pylib/rest_client: add natural_endpoints function Invoke the `/storage_service/natural_endpoints/{keyspace}` api Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-21 11:48:17 +03:00
Pavel Emelyanov	47750496d2	Merge 'test.py: metrics: add host_id suffix to .db file' from Evgeniy Naydanov CI can run several test.py sessions on different machines (builders) for one build and, and to be not overwritten, .db file with metrics need to have some unique name: add host_id as we already do for .xml report in `run_pytest()` Also add host_id columns to metric tables in case we will somehow aggregate .db files. Add host_id suffix to `toxiproxy_server.log` for the same reason. Fixes: https://github.com/scylladb/scylladb/issues/25462 Closes scylladb/scylladb#25542 * github.com:scylladb/scylladb: test.py: add host_id suffix to toxiproxy_server.log test.py: metrics: add host_id suffix to .db file	2025-08-21 11:34:47 +03:00
Robert Bindar	3291a5cc75	Fix dbuild boost::gregorian usage error On my dbuild runs, compiler complained about no member "gregorian" in namespace boost in the user_function_test.cc file. Was also noticed in CI. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#25593	2025-08-21 11:32:47 +03:00
Petr Gusev	f261b4594d	ip_address_updater: call raft_topology_update_ip even if ip hasn't changed Previously, the prev_ip check caused problems for bootstrapping nodes. Suppose a bootstrapping node A appears in the system.peers table of some other node B. Its record has only ID and IP of the node A, due to the special handling of bootstrapping nodes in raft_topology_update_ip. Suppose node B gets temporarily isolated from the topology coordinator. The topology coordinator fences out node B and succesfully finishes bootstrapping of the node A. Later, when the connectivity is restored, topology_state_load runs on the node B, node A is already in normal state, but the gossiper on B might not yet have any state for it yet. In this case, raft_topology_update_ip would not update system.peers because the gossiper state is missing. Subsequently, on_join/on_restart/on_alive events would skip updates because the IP in gossiper matches the IP for that node in system.peers. Removing the check avoids this issue, with negligible overhead: * on_join/on_restart/on_alive happen only once in a node’s lifetime * topology_state_load already updates all nodes each time it runs. This problem was found by a fencing test, which crashed a node while another node was going through the bootstrapping process. After restart the node saw that other node already is in normal state, since the topology coordinator fenced out this node and managed to finish the bootstrapping process successfully. This test will be provided in a separate fencing-for-paxos PR. Closes scylladb/scylladb#25596	2025-08-21 10:02:06 +02:00
Ernest Zaslavsky	4bee0491ba	cmake: Add missing `incremental.cc` to `repair/CMakeLists.txt` Add `incremental.cc` to `repair/CMakeLists.txt` to fix CMake based build Closes scylladb/scylladb#25601	2025-08-21 09:40:36 +03:00
Asias He	b12404ba52	streaming: Enclose potential throws in try block and ensure sink close before logging - Move the initialization of log_done inside the try block to catch any exceptions it may throw. - Relocate the failure warning log after sink.close() cleanup to guarantee sink.close() is always called before logging errors. Refs #25497 Closes scylladb/scylladb#25591	2025-08-20 19:46:56 +02:00
Dawid Pawlik	9463ac10e2	test/cqlpy: ensure Vector Search CDC options Add test to check if minimal options for Vector Search are ensured and if it is disallowed to create CDC unrespectfully to the minimal requirements.	2025-08-20 17:20:38 +02:00
Dawid Pawlik	61c7b935e1	test/boost: adjust CDC boost tests for Vector Search Adjust name conflict and permissions tests when enabling CDC for Vector Search. Add test that checks if CDC with vector column is setup properly.	2025-08-20 17:20:37 +02:00
Dawid Pawlik	45a4714ab8	test/cql: add Vector Search CDC enable/disable test Add CQL test for the automatic enablement of CDC log when creating an index on vector column using 'vector_index' custom class. Check if the logging is disabled after index is dropped.	2025-08-20 17:20:37 +02:00
Dawid Pawlik	a27eef9f18	cdc, vector_index: provide minimal option setup for Vector Search Ensure that the CDC used by Vector Search has at least 24h TTL and delta mode is set to 'full' or postimage is enabled. This setup is required by the Vector Store to work as intended. The TTL of at least 24h is a rough estimate of the maximal time needed for the full scan conducted by Vector Store to finish. The delta mode set to 'full' or postimage enabled is needed to read the values of vectors being written to the table, so Vector Store can save them in the desired external index. As the default we set TTL = 24h, delta = 'full', postimage = false. Full delta is preffered option to log the vector values as it is less costly and does not require additional read on write.	2025-08-20 17:20:20 +02:00
Ran Regev	ebf1db5c5e	remove ./redis and dependencies Remove ./redis and all its usages. This is the second commit that removes ./redis from Scylla Signed-off-by: Ran Regev <ran.regev@scylladb.com>	2025-08-20 17:53:23 +03:00
Ran Regev	6eca083137	remove redis documentation As part of removing redis from Scylla source tree. This commit removes all related documentation. Following commit remove the code itself. Signed-off-by: Ran Regev <ran.regev@scylladb.com>	2025-08-20 17:53:23 +03:00
Benny Halevy	6129411a5e	locator: utils: get_all_ranges, construct_range_to_endpoint_map: use end-bound ranges Commit `60d2cc886a` changed get_all_ranges to return start-bound ranges and pre-calculate the wrapping range, and then construct_range_to_endpoint_map to pass r.start() (that is now always engaged) as the vnode token. However, as can be seen in token_metadata_impl::first_token the token ranges (a.k.a. vnodes) end with the sorted tokens, not start with them, so an arbitrary token t belongs to a vnode in some range `sorted_tokens[i-1] < t <= sorted_tokens[i]` Fixes #25541 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25580	2025-08-20 15:15:40 +02:00
Dawid Pawlik	38d568c7c0	test/cqlpy: adjust describe table tests with CDC for Vector Search Run tests describing CDC tables both with standard and vector index created CDC log enablement. Adjust the test message of CDC log describe statement. Mark `test_desc_restore` as failing due to the #25187 bug.	2025-08-20 12:38:52 +02:00
Dawid Pawlik	35b82e6d2f	describe, cdc: adjust describe for cdc log tables Make CDC log table describe mention that it can be created by creating the vector index on base table's vector column.	2025-08-20 12:38:52 +02:00
Dawid Pawlik	af2a544395	cdc: enable CDC log when vector index is created Enable CDC log table when creating an index on vector column using 'vector_index' custom index class.	2025-08-20 12:38:52 +02:00
Dawid Pawlik	461d820fbb	test/cqlpy: run vector_index tests only on vnodes When creating an index on vector column using 'vector_index' class the CDC log is being created as it is required for Vector Search. Due to the fact that CDC does not yet work with tablets (Refs #16317) enabled we have to mark the tests failing on tablets and run them on vnodes to make sure the vector index tests continue to pass.	2025-08-20 12:38:52 +02:00
Avi Kivity	eefb6a0642	Merge 'storage_proxy: node_local_only: always use my_host_id' from Petr Gusev The previous implementation did not handle topology changes well: * In `node_local_only` mode with CL=1, if the current node is pending, the CL is increased to 2, causing `unavailable_exception`. * If the current tablet is in `write_both_read_old` and we try to read with `node_local_only` on the new node, the replica list will be empty. This patch changes `node_local_only` mode to always use `my_host_id` as the replica list. An explicit check ensures the current node is a replica for the operation; otherwise `on_internal_error` is called. backport: not needed, since `node_local_only` is only used in LWT for tablets and it hasn't been released yet. Closes scylladb/scylladb#25508 * github.com:scylladb/scylladb: test_tablets_lwt: add test_lwt_during_migration storage_proxy: node_local_only: always use my_host_id	2025-08-20 12:11:44 +03:00
Avi Kivity	34f661e5aa	Merge 'Make api/column_family endpoints capture and use sharded<database>' from Pavel Emelyanov The http_context object carries sharded<database> reference and all handlers in the api/ code can use it they way they want. This creates potential use-after-free, because the context is initialized very early and is destroyed very late. All other services are used by handlers differently -- after a service is initialized, the relevant endpoints are registered and the service reference is captured on handlers. Since endpoint deregistration is defer-scheduled at the same place, this guarantees that handlers cannot use the service after it's stopped. This PR does the same for api/ handlers -- the sharded<database> reference is captured inside set_server_column_family() and then used by handlers lambdas. Similar changes for other services: #21053, #19417, #15831, etc It's a part of the on-going cleanup of service dependencies, no need to backport Closes scylladb/scylladb#25467 * github.com:scylladb/scylladb: api/column_family: Capture sharded<database> to call get_cf_stats() api: Patch get_cf_stats to get sharded<database>& argument api: Drop CF map-reducers ability to work with http context api: Patch callers of map_reduce_cf(_raw)? to use sharded<database> api: Use captured sharded<database> reference in handlers api/column_family: Make map_reduce_cf_time_histogram() use sharded<database> api/column_famliy: Make sum_sstable() use sharded<database> api/column_family: Make get_cf_unleveled_sstables() use sharded<database> api/column_famliy: Make get_cf_stats_count() use sharded<database> api/column_family: Make get_cf_rate_and_histogram() use sharded<database> api/column_family: Make get_cf_histogram() use sharded<database> api/column_family: Make get_cf_stats_sum() use sharded<database> api/column_family: Make set_tables_tombstone_gc() use sharded<database> api/column_family: Make set_tables_autocompaction() use sharded<database> api/column_family: Make for_tables_on_all_shards() use sharded<database> api: Capture sharded<database> for set_server_column_family() api: Make CF map-reducers work on sharded<database> directly api: Make map_reduce_cf_time_histogram() file-local api: Remove unused ctx argument from run_toppartitions_query()	2025-08-20 12:09:39 +03:00
Dawid Pawlik	27ceb85508	vector_index: check if vector index exists in schema Add `has_vector_index` function to check if an index on vector column using 'vector_index' custom index class exists in the schema. Co-authored-by: Michał Hudobski <michal.hudobski@scylladb.com>	2025-08-20 10:35:55 +02:00
Avi Kivity	352cda4467	treewide: avoid including gms/feature_service.hh from headers To avoid dependency proliferation, switch to forward declarations. In one case, we introduce indirection via std::unique_ptr and deinline the constructor and destructor. Ref #1 Closes scylladb/scylladb#25584	2025-08-20 10:30:27 +03:00
Botond Dénes	d20304fdf8	Merge 'test.py: dtest: port next_gating tests from commitlog_test.py' from Evgeniy Naydanov Copy `commitlog_test.py` from scylla-dtest test suite and make it works with `test.py` As a part of the porting process, remove unused imports and markers, remove non-next_gating tests and tests marked with `skip`, 'skip_if', and `xfail` markers. test.py uses `commitlog` directory instead of dtest's `commitlogs`. Also, add `commitlog_segment_size_in_mb: 32` option to test_stop_failure_policy to make _provoke_commitlog_failure work. Tests `test_total_space_limit_of_commitlog_with_large_limit` and `test_total_space_limit_of_commitlog_with_medium_limit` use too much disk space and have too big execution time. Keep them in scylla-dtest for now. Enable the test in `suite.yaml` (run in dev mode only.) Additional modifications to test.py/dtest shim code: - add ScyllaCluster.flush() method - add ScyllaNode.stress() method - add tools/files.py::corrupt_file() function - add tools/data.py::run_query_with_data_processing() function - copy some assertions from dtest Also add missed mode restriction for auth_test.py file. Closes scylladb/scylladb#24946 * github.com:scylladb/scylladb: test.py: dtest: remove slow and greedy tests from commitlog_test.py test.py: dtest: make commitlog_test.py run using test.py test.py: dtest: add ScyllaCluster.flush() method test.py: dtest: add ScyllaNode.stress() method test.py: dtest: add tools/data.py::run_query_with_data_processing() function test.py: dtest: add tools/files.py::corrupt_file() function test.py: dtest: copy some assertions from dtest test.py: dtest: copy unmodified commitlog_test.py	2025-08-19 17:25:07 +03:00
Michał Chojnowski	c1b513048c	sstables/types.hh: fix fmt::formatter<sstables::deletion_time> Obvious typo. Fixes scylladb/scylladb#25556 Closes scylladb/scylladb#25557	2025-08-19 17:21:18 +03:00
Petr Gusev	894c8081e6	test_tablets_lwt: add test_lwt_during_migration	2025-08-19 16:11:56 +02:00
Petr Gusev	ed6bec2cac	storage_proxy: node_local_only: always use my_host_id The previous implementation did not handle topology changes well: * In node_local_only mode with CL=1, if the current node is pending, the CL is raised to 2, causing unavailable_exception. * If the current tablet is in write_both_read_old and we read with node_local_only on the new node, the replica list is empty. This patch changes node_local_only mode to always use my_host_id as the replica list. An explicit check ensures the current node is a replica for the operation; otherwise on_internal_error is called.	2025-08-19 16:11:49 +02:00
Evgeniy Naydanov	47e4d470af	test.py: add host_id suffix to toxiproxy_server.log	2025-08-19 11:33:47 +00:00
Evgeniy Naydanov	8ea49092b7	test.py: metrics: add host_id suffix to .db file CI can run several test.py sessions on different machines (builders) for one build and, and to be not overwritten, .db file with metrics need to have some unique name: add host_id as we already do for .xml report in run_pytest() Also add host_id columns to metric tables in case we will somehow aggregate .db files.	2025-08-19 11:33:11 +00:00
Botond Dénes	66db95c048	Merge 'Preserve PyKMIP logs from failed KMIP tests' from Nikos Dragazis This PR extends the `tmpdir` class with an option to preserve the directory if the destructor is called during stack unwinding. It also uses this feature in KMIP tests, where the tmpdir contains PyKMIP server logs, which may be useful when diagnosing test failures. Fixes #25339. Not so important to be backported. Closes scylladb/scylladb#25367 * github.com:scylladb/scylladb: encryption_at_rest_test: Preserve tmpdir from failing KMIP tests test/lib: Add option to preserve tmpdir on exception	2025-08-19 13:17:29 +03:00
Avi Kivity	611918056a	Merge 'repair: Add tablet incremental repair support' from Asias He The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472 Closes scylladb/scylladb#24291 * github.com:scylladb/scylladb: replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair compaction: Move compaction_reenabler to compaction_reenabler.hh topology_coordinator: Make rpc::remote_verb_error to warning level repair: Add metrics for sstable bytes read and skipped from sstables test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair test.py: Add tests for tablet incremental repair repair: Add tablet incremental repair support compaction: Add tablet incremental repair support feature_service: Add TABLET_INCREMENTAL_REPAIR feature tablet_allocator: Add tablet_force_tablet_count_increase and decrease repair: Add incremental helpers sstable: Add being_repaired to sstable sstables: Add set_repaired_at to metadata_collector mutation_compactor: Introduce add operator to compaction_stats tablet: Add sstables_repaired_at to system.tablets table test: Fix drain api in task_manager_client.py	2025-08-19 13:13:22 +03:00
Dawid Pawlik	50eeb11c84	.gitignore: add rust target When using automatic rust build tools in IDE, the files generated in `rust/target/` directory has been treated by git as unstaged changes. After the change, the generated files will not pollute the git changes interface. Closes scylladb/scylladb#25389	2025-08-19 13:09:18 +03:00
Dawid Mędrek	6a71461e53	treewide: Fix spelling errors The errors were spotted by our GitHub Actions. Closes scylladb/scylladb#24822	2025-08-19 13:07:43 +03:00
libo2_yewu	fa84e20b7a	scripts/coverage.py: correct the coverage report path the `path/name` directory is not exist and needs to be created first. Signed-off-by: libo-sober <libo_sober@163.com> Closes scylladb/scylladb#25480	2025-08-19 13:01:49 +03:00
Sayanta Banerjee	eae1869d3a	Update docs/features/cdc/cdc-streams.rst Co-authored-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2025-08-19 15:06:30 +05:30
Avi Kivity	41475858aa	storage_proxy: endpoint_filter(): fix rack count confusion endpoint_filter() is used by batchlog to select nodes to replicate to. It contains an unordered_multimap data structure that maps rack names to nodes. It misuses std::unordered_map::bucket_count() to count the number of racks. While values that share a key in a multimap will definitly be in the same bucket, it's possible for values that don't share a key to share a bucket. Therefore bucket_count() undercounts the number of racks. Fix this by using a more accurate data structure: a map of a set. The patch changes validated.bucket_count() to validated.size() and validated.size() to a new variable nr_validated. The patch does cause an extra two allocations per rack (one for the unordered_map node, one for the unordered_set bucket vector), but this is only used for logged batches, so it is amortized over all the mutations in the logged batch. Closes scylladb/scylladb#25493	2025-08-19 11:58:39 +03:00
Dawid Mędrek	2227eb48bb	test/cqlpy/test_cdc.py: Add validation test for re-attached log tables When the user disables CDC on a table, the CDC log table is not removed. Instead, it's detached from the base table, and it functions as a normal table (with some differences). If that log table lives up to the point when the user re-enabled CDC on the base table, instead of creating a new log table, the old one is re-attached to the base. For more context on that, see commit: scylladb/scylladb@adda43edc7. In this commit, we add validation tests that check whether the changes on the base table after disabling CDC are reflected on the log table after re-enabling CDC. The definition of the log table should be the same as if CDC had never been disabled. Closes scylladb/scylladb#25071	2025-08-19 10:15:41 +02:00
Botond Dénes	f8b79d563a	Merge 's3: Minor refactoring and beautification of S3 client and tests' from Ernest Zaslavsky This pull request introduces minor code refactoring and aesthetic improvements to the S3 client and its associated test suite. The changes focus on enhancing readability, consistency, and maintainability without altering any functional behavior. No backport is required, as the modifications are purely cosmetic and do not impact functionality or compatibility. Closes scylladb/scylladb#25490 * github.com:scylladb/scylladb: s3_client: relocate `req` creation closer to usage s3_client: reformat long logging lines for readability s3_test: extract file writing code to a function	2025-08-18 18:48:42 +03:00
Aleksandra Martyniuk	a10e241228	replica: lower severity of failure log Flush failure with seastar::named_gate_closed_exception is expected if a respective compaction group was already stopped. Lower the severity of a log in dirty_memory_manager::flush_one for this exception. Fixes: https://github.com/scylladb/scylladb/issues/25037. Closes scylladb/scylladb#25355	2025-08-18 13:30:42 +03:00
Avi Kivity	96956e48c4	Merge 'utils: stall_free: detect clear_gently method of const payload types' from Benny Halevy Currently, when a container or smart pointer holds a const payload type, utils::clear_gently does not detect the object's clear_gently method as the method is non-const and requires a mutable object, as in the following example in class tablet_metadata: ``` using tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>; using table_to_tablet_map = std::unordered_map<table_id, tablet_map_ptr>; ``` That said, when a container is cleared gently the elements it holds are destroyed anyhow, so we'd like to allow to clear them gently before destruction. This change still doesn't allow directly calling utils::clear_gently an const objects. And respective unit tests. Fixes #24605 Fixed #25026 * This is an optimization that is not strictly required to backport (as https://github.com/scylladb/scylladb/pull/24618 dealt with clear_gently of `tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>` well enough) Closes scylladb/scylladb#24606 * github.com:scylladb/scylladb: utils: stall_free: detect clear_gently method of const payload types utils: stall_free: clear gently a foreign shared ptr only when use_count==1	2025-08-18 12:52:02 +03:00
Evgeniy Naydanov	ab1a093d94	test.py: dtest: remove slow and greedy tests from commitlog_test.py Tests test_total_space_limit_of_commitlog_with_large_limit and test_total_space_limit_of_commitlog_with_medium_limit use too much disk space and have too big execution time. Keep them in scylla-dtest for now.	2025-08-18 09:42:13 +00:00
Evgeniy Naydanov	647043d957	test.py: dtest: make commitlog_test.py run using test.py As a part of the porting process, remove unused imports and markers, remove non-next_gating tests and tests marked with `skip`, 'skip_if', and `xfail` markers. test.py uses `commitlog` directory instead of dtest's `commitlogs`. Remove test_stop_failure_policy test because the way how it provoke commitlog failure (change file permission) doesn't work on CI. Enable the test in suite.yaml (run in dev mode only)	2025-08-18 09:42:13 +00:00
Evgeniy Naydanov	5f6e083124	test.py: dtest: add ScyllaCluster.flush() method	2025-08-18 09:42:13 +00:00
Evgeniy Naydanov	c378dc3fab	test.py: dtest: add ScyllaNode.stress() method	2025-08-18 09:42:13 +00:00
Evgeniy Naydanov	6f42019900	test.py: dtest: add tools/data.py::run_query_with_data_processing() function	2025-08-18 09:42:13 +00:00
Evgeniy Naydanov	2c4f2de3b0	test.py: dtest: add tools/files.py::corrupt_file() function	2025-08-18 09:42:13 +00:00
Evgeniy Naydanov	80b797e376	test.py: dtest: copy some assertions from dtest Copy assertions required for commitlog_test.py: - assert_almost_equal - assert_row_count - assert_row_count_in_select_less - assert_lists_equal_ignoring_order	2025-08-18 09:42:13 +00:00
Evgeniy Naydanov	1a2d132456	test.py: dtest: copy unmodified commitlog_test.py	2025-08-18 09:42:13 +00:00
Pavel Emelyanov	4f55af9578	Merge 'test.py: pytest: support --mode/--repeat in a common way for all tests' from Evgeniy Naydanov Implement repetition of files using `pytest_collect_file` hook: run file collection as many times as needed to cover all `--mode`/`--repeat` combinations. Store build mode and run ID to the stash of repeated item. Some additional changes done: - Add `TestSuiteConfig` class to handle all operations with `test_config.yaml` - Add support for `run_first` option in `test_config.yaml` - Move disabled test logic to `pytest_collect_file` hook. These changes allow to to remove custom logic for `--mode`, `--repeat`, and disabled tests in the code for C++ tests and prepare for switching of Python/CQLApproval/Topology tests to pytest runner. Also, this PR includes required refactoring changes and fixes: - Simplify support of C++ tests: remove redundant facade abstraction and put all code into 3 files: `base.py`, `boost.py`, and `unit.py` - Remove unused imports in `test.py` - Use the constant for `"suite.yaml"` string - Some test suites have own test runners based on pytest, and they don't need all stuff we use for `test.py`. Move all code related to `test.py` framework to `test/pylib/runner.py` and use it as a plugin conditionally (by using `SCYLLA_TEST_RUNNER` env variable.) - Add `cwd` parameter to `run_process()` methods in `resource_gather` module to avoid using of `os.chdir()` (and sort parameters in the same order as in `subprocess.Popen`.) - `extra_scylla_cmdline_options` is a list of commandline arguments and, actually, each argument should be a separate item. Few configuration files have `--reactor-backend` option added in the format which doesn't follow this rule. This PR is a refactoring step for https://github.com/scylladb/scylladb/pull/25443 Closes scylladb/scylladb#25465 * github.com:scylladb/scylladb: test.py: pytest: support --mode/--repeat in a common way for all tests test.py: pytest: streamline suite configuration handling test.py: refactor: remove unused imports in test.py test.py: fix run with bare pytest after merge of scylladb/scylladb#24573 test.py: refactor: move framework-related code to test.pylib.runner test.py: resource_gather: add cwd parameter to run_process() test.py: refactor: use proper format for extra_scylla_cmdline_options	2025-08-18 12:24:04 +03:00
Pavel Emelyanov	2510a7b488	api/column_family: Capture sharded<database> to call get_cf_stats() Update more handlers not to get databse from context, but to capture it directly on handlers' lambdas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 12:04:34 +03:00
Pavel Emelyanov	dc31b68451	api: Patch get_cf_stats to get sharded<database>& argument Now it accepts http context and immediately gets the database from it to pass to map_reduce_cf. Callers are updated to pass database from where the context they already have. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 12:03:27 +03:00
Avi Kivity	e9928b31b8	Merge 'sstables/trie: add BTI key translation routines' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25396 Next part: implementing sstable index writers and readers on top of the abstract trie writers/readers. The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability. This series provides translation routines for ring positions and clustering positions from Scylla's native in-memory structures to BTI's byte-comparable encoding. This translation is performed whenever a new decorated key or clustering block are added to a BTI index, and whenever a BTI index is queried for a range of positions. For a description of the encoding, see `fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)` The translation logic, with all the fragment awareness, lazy evaluation and avoidable copies, is fairly bloated for the common cases of simple and small keys. This is a potential optimization target for later. No backports needed, new functionality. Closes scylladb/scylladb#25506 * github.com:scylladb/scylladb: sstables/trie: add BTI key translation routines tests/lib: extract generate_all_strings to test/lib tests/lib: extract nondeterministic_choice_stack to test/lib sstables/trie/trie_traversal: extract comparable_bytes_iterator to its own file sstables/mx: move clustering_info from writer.cc to types.hh sstables/trie: allow `comparable_bytes_iterator` to return a mutable span dht/ring_position: add ring_position_view::weight()	2025-08-18 11:55:26 +03:00
Pavel Emelyanov	7933a68921	api: Drop CF map-reducers ability to work with http context This patch finalizes the change started by the previous patch of the similar title -- the map_reduce_cf(_raw) is switched to work only with sharded<replica::database> reference. All callers were updated by previous patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:45:24 +03:00
Pavel Emelyanov	ffd25f0c16	api: Patch callers of map_reduce_cf(_raw)? to use sharded<database> There are some of them left that still pass http_context. These handlers will eventually get their captured sharded database reference, but for now make them explicitly use one from context. This will allow to de-templatize map_reduce_cf... helpers making the code simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:43:55 +03:00
Pavel Emelyanov	7e0726d55b	api: Use captured sharded<database> reference in handlers Not all of them can switch from ctx to database, so in few places both, the database and ctx, are captured. However, the ctx.db reference is no longer used by the column_family handlers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:55 +03:00
Pavel Emelyanov	720a8fef4b	api/column_family: Make map_reduce_cf_time_histogram() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:55 +03:00
Pavel Emelyanov	49cb81fb56	api/column_famliy: Make sum_sstable() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:55 +03:00
Pavel Emelyanov	3595ea7f49	api/column_family: Make get_cf_unleveled_sstables() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:55 +03:00
Pavel Emelyanov	d32ac35f60	api/column_famliy: Make get_cf_stats_count() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:55 +03:00
Pavel Emelyanov	cde39d3fc7	api/column_family: Make get_cf_rate_and_histogram() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	edc9e302e3	api/column_family: Make get_cf_histogram() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	422debbee2	api/column_family: Make get_cf_stats_sum() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	1c1fabc578	api/column_family: Make set_tables_tombstone_gc() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	0158743f5e	api/column_family: Make set_tables_autocompaction() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	f52eb7cae2	api/column_family: Make for_tables_on_all_shards() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	818a41ccdb	api: Capture sharded<database> for set_server_column_family() Similarly to other API handlers, instead of using a database from http context, patch the setting methods to capture the database from main code and pass it around to handlers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	d3d217b3c9	api: Make CF map-reducers work on sharded<database> directly Next patches are going to change a bunch of map_reduce_cf_... callers to pass sharded<database> reference to it, not the http context. Not to patch all the api/ code at once, keep the ability to call it with ctx at hand. Eventually only the sharded<database>& overload will be kept. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:53 +03:00
Pavel Emelyanov	bdb7c2b014	api: Make map_reduce_cf_time_histogram() file-local It's not used outside of api/column_family.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:53 +03:00
Pavel Emelyanov	b0db83575c	api: Remove unused ctx argument from run_toppartitions_query() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:53 +03:00
Asias He	082bc70a0a	replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair It helps to hide the compaction_group_views from repair subsystem.	2025-08-18 11:01:22 +08:00
Asias He	be15972006	compaction: Move compaction_reenabler to compaction_reenabler.hh So it can be used without bringing the whole compaction/compaction_manager.hh.	2025-08-18 11:01:22 +08:00
Asias He	cac4940129	topology_coordinator: Make rpc::remote_verb_error to warning level This could happen in case the peer node is in shutdown. This is not something we can not recovery. The log level should be warning instead of error which our dtest catches for failure of a test. This was observed in test_repair_one_node_alter_rf dtest.	2025-08-18 11:01:22 +08:00
Asias He	76316f44a7	repair: Add metrics for sstable bytes read and skipped from sstables scylla_repair_inc_sst_skipped_bytes: Total number of bytes skipped from sstables for incremental repair on this shard. scylla_repair_inc_sst_read_bytes : Total number of bytes read from sstables for incremental repair on this shard.	2025-08-18 11:01:22 +08:00
Asias He	b0364fcba3	test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair Disable incremental repair so that the second repair can still work on the repaired data set.	2025-08-18 11:01:22 +08:00
Asias He	ad5275fd4c	test.py: Add tests for tablet incremental repair The following tests are added for tablet incremental repair: - Basic incremental repair - Basic incremental repair with error - Minor compaction and incremental repair - Major compaction and incremental repair - Scrub compaction and incremental repair - Cleanup/Upgrade compaction and incremental repair - Tablet split and incremental repair - Tablet merge and incremental repair	2025-08-18 11:01:21 +08:00
Asias He	0d7e518a26	repair: Add tablet incremental repair support The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472	2025-08-18 11:01:21 +08:00
Asias He	f9021777d8	compaction: Add tablet incremental repair support This patch addes incremental_repair support in compaction. - The sstables are split into repaired and unrepaired set. - Repaired and unrepaired set compact sperately. - The repaired_at from sstable and sstables_repaired_at from system.tablets table are used to decide if a sstable is repaired or not. - Different compactions tasks, e.g., minor, major, scrub, split, are serialized with tablet repair.	2025-08-18 11:01:21 +08:00
Evgeniy Naydanov	e44b26b809	test.py: pytest: support --mode/--repeat in a common way for all tests Implement repetition of files using pytest_collect_file hook: run file collection as many times as needed to cover all --mode/--repeat combinations. Also move disabled test logic to this hook. Store build mode and run_id in pytest item stashes. Simplify support of C++ tests: remove redundant facade abstraction and put all code into 3 files: base.py, boost.py, and unit.py Add support for `run_first` option in test_config.yaml	2025-08-17 15:26:23 +00:00
Evgeniy Naydanov	bffb6f3d01	test.py: pytest: streamline suite configuration handling Move test_config.yaml handling code from common_cpp_conftest.py to TestSuiteConfig class in test/pylib/runner.py	2025-08-17 12:32:36 +00:00
Evgeniy Naydanov	a2a59b18a3	test.py: refactor: remove unused imports in test.py Also use the constant for "suite.yaml" string.	2025-08-17 12:32:36 +00:00
Evgeniy Naydanov	a188523448	test.py: fix run with bare pytest after merge of scylladb/scylladb#24573 To run tests with bare pytest command we need to have almost the same set of options as test.py because we reuse code from test.py. scylladb/scylladb#24573 added `--pytest-arg` option to test.py but not to test/conftest.py which breaks running Python tests using bare pytest command.	2025-08-17 12:32:35 +00:00
Evgeniy Naydanov	600d05471b	test.py: refactor: move framework-related code to test.pylib.runner Some test suites have own test runners based on pytest, and they don't need all stuff we use for test.py. Move all code related to test.py framework to test/pylib/runner.py and use it as a plugin conditionally (by using TEST_RUNNER variable.)	2025-08-17 12:32:35 +00:00
Evgeniy Naydanov	f2619d2bb0	test.py: resource_gather: add cwd parameter to run_process() Also done sort arguments in Popen call to match the signature.	2025-08-17 12:32:35 +00:00
Evgeniy Naydanov	cb4d9b8a09	test.py: refactor: use proper format for extra_scylla_cmdline_options `extra_scylla_cmdline_options` is a list of commandline arguments and, actually, each argument should be a separate item. Few configuration files have `--reactor-backend` option added in the format which doesn't follow this rule.	2025-08-17 12:32:35 +00:00
Michał Chojnowski	413dcf8891	sstables/trie: add BTI key translation routines This file provides translation routines for ring positions and clustering positions from Scylla's native in-memory structures to BTI's byte-comparable encoding. This translation is performed whenever a new decorated key or clustering block are added to a BTI index, and whenever a BTI index is queried for a range of positions. For a description of the encoding, see `fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)` The translation logic, with all the fragment awareness, lazy evaluation and avoidable copies, is fairly bloated for the common cases of simple and small keys. This is a potential optimization target for later.	2025-08-15 11:13:00 +02:00
Pavel Emelyanov	f689d41747	Merge 'db/hints: Improve logs' from Dawid Mędrek Before these changes, the logs in hinted handoff often didn't provide crucial information like the identifier of the node that hints were being sent to. Also, some of the logs were misleading and referred to other places in the code than the one where an exception or some other situation really occurred. We modify those logs, extending them by more valuable information and fixing existing issues. What's more, all of the logs in `hint_endpoint_manager` and `hint_sender` follow a consistent format now: ``` <class_name>[<destination host ID>]:<function_name>: <message> ``` This way, we should always have AT LEAST the basic information. Fixes scylladb/scylladb#25466 Backport: There is no risk in backporting these changes. They only have impact on the logs. On the other hand, they might prove helpful when debugging an issue in hinted handoff. Closes scylladb/scylladb#25470 * github.com:scylladb/scylladb: db/hints: Add new logs db/hints: Adjust log levels db/hints: Improve logs	2025-08-15 09:34:29 +03:00
Patryk Jędrzejczak	03cc34e3a0	test: test_maintenance_socket: use cluster_con for driver sessions The test creates all driver sessions by itself. As a consequence, all sessions use the default request timeout of 10s. This can be too low for the debug mode, as observed in scylladb/scylla-enterprise#5601. In this commit, we change the test to use `cluster_con`, so that the sessions have the request timeout set to 200s from now on. Fixes scylladb/scylla-enterprise#5601 This commit changes only the test and is a CI stability improvement, so it should be backported all the way to 2024.2. 2024.1 doesn't have this test. Closes scylladb/scylladb#25510	2025-08-15 09:32:20 +03:00
Pavel Emelyanov	05d8d94257	Merge 'test.py: Add -k=EXPRESSION pytest argument support for boost tests.' from Artsiom Mishuta follow-up PR after fast fix https://github.com/scylladb/scylladb/pull/25394 should be merged only after - https://github.com/scylladb/scylla-pkg/pull/5414 Since boost tests run via pure pytest, we can finally run tests using -k=EXPRESSION pytest argument. This expression will be applied to the "test function". So it will be possible to run: subset of test functions that match patterns across all boosts tests(functions) arguments --skip and -k are mutually exclusive due to -k extends --skip functionality examples: ``` ./build/release/test/boost/auth_passwords_test --list_content passwords_are_salted* correct_passwords_authenticate* incorrect_passwords_do_not_authenticate* ./test.py --mode=dev -k="correct" -vv test/boost/auth_passwords_test.cc PASSED test/boost/auth_passwords_test.cc::incorrect_passwords_do_not_authenticate.dev.1 PASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1 ./test.py --mode=dev -k="not incorrect and not passwords_are_salted" -vv test/boost/auth_passwords_test.cc PASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1 ./test.py --mode=dev --skip=incorrect --skip=passwords_are_salted -vv test/boost/auth_passwords_test.cc PASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1 ./test.py --mode=dev -k="correct and not incorrect" -vv test/boost/auth_passwords_test.cc ASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1 ``` Closes scylladb/scylladb#25400 * github.com:scylladb/scylladb: test.py: add -k=EXPRESSION pytest argument support for boost tests. test.py: small refactoring of how boost test arguments make	2025-08-15 09:24:56 +03:00
Jenkins Promoter	d4ce070168	Update pgo profiles - aarch64	2025-08-15 05:03:28 +03:00
Jenkins Promoter	c0f691f4d9	Update pgo profiles - x86_64	2025-08-15 04:56:11 +03:00
Michał Chojnowski	5e76708335	tests/lib: extract generate_all_strings to test/lib This util will be used in another test file in a later commit, so hoist it to `test/lib`.	2025-08-14 22:38:38 +02:00
Taras Veretilnyk	30ff5942c6	database_test: fix race in test_drop_quarantined_sstables The test_drop_quarantined_sstables test could fail due to a race between compaction and quarantining of SSTables. If compaction selects an SSTable before it is moved to quarantine, and change_state is called during compaction, the SSTable may already be removed, resulting in a std::filesystem_error due to missing files. This patch resolves the issue by wrapping the quarantine operation inside run_with_compaction_disabled(). This ensures compaction is paused on the compaction group view while SSTables are being quarantined, preventing the race. Additionally, updates the test to quarantine up to 1/5 SSTables instead of one randomly and increases the number of sstables genereted to improve test scenario. Fixes scylladb/scylladb#25487 Closes scylladb/scylladb#25494	2025-08-14 20:23:42 +03:00
Taras Veretilnyk	367eaf46c5	keys: from_nodetool_style_string don't split single partition keys Users with single-column partition keys that contain colon characters were unable to use certain REST APIs and 'nodetool' commands, because the API split key by colon regardless of the partition key schema. Affected commands: - 'nodetool getendpoints' - 'nodetool getsstables' Affected endpoints: - '/column_family/sstables/by_key' - '/storage_service/natural_endpoints' Refs: #16596 - This does not fully fix the issue, as users with compound keys will face the issue if any column of the partition key contains a colon character. Closes scylladb/scylladb#24829	2025-08-14 19:52:04 +03:00
Avi Kivity	1ef6697949	Merge 'service/vector_store_client: Add live configuration update support' from Karol Nowacki Enable runtime updates of vector_store_uri configuration without requiring server restart. This allows to dynamically enable, disable, or switch the vector search service endpoint on the fly. To improve the clarity the seastar::experimental::http::client is now wrapped in a private http_client class that also holds the host, address, and port information. Tests have been added to verify that the client correctly handles transitions between enabled/disabled states and successfully switches traffic to a new endpoint after a configuration update. Closes: VECTOR-102 No backport is needed as this is a new feature. Closes scylladb/scylladb#25208 * github.com:scylladb/scylladb: service/vector_store_client: Add live configuration update support test/boost/vector_store_client_test.cc: Refactor vector store client test service/vector_store_client: Refactor host_port struct created service/vector_store_client: Refactor HTTP request creation	2025-08-14 19:45:06 +03:00
Avi Kivity	fe6e1071d3	Merge 'locator: util: optimize describe_ring' from Benny Halevy This change includes basic optimizations to locator::describe_ring, mainly caching the per-endpoint information in an unordered_map instead of looking them up in every inner-loop. This yields an improvement of 20% in cpu time. With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per node, yielding 11520 ranges and 9 replicas per range, describe_ring took Before: 30 milliseconds (2.6 microseconds per range) After: 24 milliseconds (2.1 microseconds per range) Add respective unit test for vnode keyspace and for tablets. Fixes #24887 * backport up to 2025.1 as describe_ring slowness was hit in the field with large clusters Closes scylladb/scylladb#24889 * github.com:scylladb/scylladb: locator: util: optimize describe_ring locator: util: construct_range_to_endpoint_map: pass is_vnode=true to get_natural_replicas vnode_effective_replication_map: do_get_replicas: throw internal error if token not found in map locator: effective_replication_map: get_natural_replicas: get is_vnode param test: cluster: test_repair: add test_vnode_keyspace_describe_ring	2025-08-14 19:39:17 +03:00
Ernest Zaslavsky	a0016bd0cc	s3_client: relocate `req` creation closer to usage Move the creation of the `req` object to the point where it is actually used, improving code clarity and reducing premature initialization.	2025-08-14 16:18:43 +03:00
Ernest Zaslavsky	6ef2b0b510	s3_client: reformat long logging lines for readability Break up excessively long logging statements to improve readability and maintain consistent formatting across the codebase.	2025-08-14 16:18:43 +03:00
Ernest Zaslavsky	29960b83b5	s3_test: extract file writing code to a function Reduce code doing the same over and over again by extracting file writing code to a function	2025-08-14 16:18:43 +03:00
Artsiom Mishuta	fcd511a531	test.py: add -k=EXPRESSION pytest argument support for boost tests. Since boost tests run via pure pytest, we can finally run tests using -k=EXPRESSION pytest argument. This expression will be applied to the "test function". So it will be possible to run: subset of test functions that match patterns across all boosts tests(functions) arguments --skip and -k are mutually exclusive due to -k extends --skip functionality examples: ./build/release/test/boost/auth_passwords_test --list_content passwords_are_salted* correct_passwords_authenticate* incorrect_passwords_do_not_authenticate* ./test.py --mode=dev -k="correct" -vv test/boost/auth_passwords_test.cc PASSED test/boost/auth_passwords_test.cc::incorrect_passwords_do_not_authenticate.dev.1 PASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1 ./test.py --mode=dev -k="not incorrect and not passwords_are_salted" -vv test/boost/auth_passwords_test.cc PASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1 ./test.py --mode=dev --skip=incorrect --skip=passwords_are_salted -vv test/boost/auth_passwords_test.cc PASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1 ./test.py --mode=dev -k="correct and not incorrect" -vv test/boost/auth_passwords_test.cc ASSED test/boost/auth_passwords_test.cc::correct_passwords_authenticate.dev.1	2025-08-14 14:45:40 +02:00
Artsiom Mishuta	d589f36645	test.py: small refactoring of how boost test arguments make During migration, boost tests to pytest, a big portion of the logic was used "as is" with bad code and bugs This PR refactors the function that makes an argument for the pytest command: 1)refactor how modes are provided 2)refactor how --skip provided 3)remove shlex.split woraround	2025-08-14 14:45:28 +02:00
Abhinav Jha	a0ee5e4b85	raft: replication test: change rpc_propose_conf_change test to SEASTAR_THREAD_TEST_CASE RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet loss named name_drops. The framework makes hard coded assumptions about leader which doesn't hold well in case of packet losses. This short term fix disables the packet drop variant of the specified test. It should be safe to re-enable it once the whole framework is re-worked to remove these hard coded assumptions. This PR fixes a bug. Hence we need to backport it. Fixes: scylladb/scylladb#23816 Closes scylladb/scylladb#25489	2025-08-14 13:15:16 +02:00
Dawid Mędrek	6f1fb7cfb5	db/hints: Add new logs We're adding new logs in just a few places that may however prove important when debugging issues in hinted handoff in the future.	2025-08-14 11:45:24 +02:00
Dawid Mędrek	d7bc9edc6c	db/hints: Adjust log levels Some of the logs could be clogging Scylla's logs, so we demote their level to a lower one. On the other hand, some of the logs would most likely not do that, and they could be useful when debugging -- we promote them to debug level.	2025-08-14 11:45:24 +02:00
Dawid Mędrek	2327d4dfa3	db/hints: Improve logs Before these changes, the logs in hinted handoff often didn't provide crucial information like the identifier of the node that hints were being sent to. Also, some of the logs were misleading and referred to other places in the code than the one where an exception or some other situation really occurred. We modify those logs, extending them by more valuable information and fixing existing issues. What's more, all of the logs in `hint_endpoint_manager` and `hint_sender` follow a consistent format now: ``` <class_name>[<destination host ID>]:<function_name>: <message> ``` This way, we should always have AT LEAST the basic information.	2025-08-14 11:45:04 +02:00
Anna Stuchlik	841ba86609	doc: document support for new z3 instance types This commit adds new z3 instances we now support to the list of GCP instance types. Fixes https://github.com/scylladb/scylladb/issues/25438 Closes scylladb/scylladb#25446	2025-08-14 10:59:45 +02:00
Avi Kivity	66173c06a3	Merge 'Eradicate the ability to create new sstables with numerical sstable generation' from Benny Halevy Remove support for generating numerical sstable generation for new sstables. Loading such sstables is still supported but new sstables are always created with a uuid generation. This is possible since: * All live versions (since 5.4 / `f014ccf369`) now support uuid sstable generations. * The `uuid_sstable_identifiers_enabled` config option (that is unused from version 2025.2 / `6da758d74c`) controls only the use of uuid generations when creating new sstables. SSTables with uuid generations should still be properly loaded by older versions, even if `uuid_sstable_identifiers_enabled` is set to `false`. Fixes #24248 * Enhancement, no backport needed Closes scylladb/scylladb#24512 * github.com:scylladb/scylladb: streaming: stream_blob: use the table sstable_generation_generator replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator sstables: sstable_generation_generator: stop tracking highest generation replica: table: get rid of update_sstables_known_generation sstables: sstable_directory: stop tracking highest_generation replica: distributed_loader: stop tracking highest_generation sstables: sstable_generation: get rid of uuid_identifiers bool class sstables_manager: drop uuid_sstable_identifiers feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set test: cql_query_test: add test_sstable_load_mixed_generation_type test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils test: database_test: move table_dir helper to test/lib/test_utils	2025-08-14 11:54:33 +03:00
Anna Stuchlik	1e5659ac30	doc: add the information about ScyllaDB C# Driver This commit adds the driver to the list of ScyllaDB drivers, including the information about: - CDC integration (not available) - Tablets (supported) Fixes https://github.com/scylladb/scylladb/issues/25495 Closes scylladb/scylladb#25498	2025-08-14 11:29:52 +03:00
Patryk Jędrzejczak	6ad2b71d04	Merge 'LWT: communicate RPC errors to the user' from Petr Gusev Currently, if the accept or prepare verbs fail on the replica side, the user only receives a generic error message of the form "something went wrong for this table", which provides no insight into the root cause. Additionally, these error messages are not logged by default, requiring the user to restart the node with trace or debug logging to investigate the issue. This PR improves error handling for the accept and prepare verbs by preserving and propagating the original error messages, making it easier to diagnose failures. backport: not needed, not a bug Closes scylladb/scylladb#25318 * https://github.com/scylladb/scylladb: test_tablets_lwt: add test_error_message_for_timeout_due_to_uncertainty storage_proxy: preserve accept error messages storage_proxy: preserve prepare error message storage_proxy: fix log message exceptions.hh: fix message argument passing exceptions: add constructors that accept explicit error messages	2025-08-14 10:23:32 +02:00
Nadav Har'El	2d3c0eb25a	test/alternator: speed up test_ttl_expiration_lsi_key The Alternator test test_ttl.py::test_ttl_expiration_lsi_key is currently the second-slowest test/alternator test, run a "whopping" 2.6 seconds (the total of two parameterizations - with vnodes and tables). This patch reduces it to 0.9 seconds. The fix is simple: Unfortunately, tests that need to wait for actual TTL expiration take time, but the test framework configures the TTL scanner to have a period of half a second, so the wait should be on average around 0.25 seconds. But the test code by mistake slept 1.2 seconds between retries. We even had a good "sleep" variable for the amount of time we should sleep between retries, but forgot to use it. So after lowering the sleep between retries, this test is still not instantenous - it still needs to wait up to 0.5 seconds for the expirations to occur - but it's almost 3 times faster than before. While working on this test, I also used the opportunity to update its comment which excused why we are testing LSI and not GSI. Its suggestions of what is planned for GSI have already become a reality, so let's update the comment to say so. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25386	2025-08-14 11:21:52 +03:00
Pavel Emelyanov	eaec7c9b2e	Merge 'cql3: add default replication strategy to `create_keyspace_statement`' from Dario Mirovic When creating a new keyspace, both replication strategy and replication factor must be stated. For example: `CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 3 };` This syntax is verbose, and in all but some testing scenarios `NetworkTopologyStrategy` is used. This patch allows skipping replication strategy name, filling it with `NetworkTopologyStrategy` when that happens. The following syntax is now valid: `CREATE KEYSPACE ks WITH REPLICATION = { 'replication_factor' : 3 };` and will give the same result as the previous, more explicit one. Fixes https://github.com/scylladb/scylladb/issues/16029 Backport is not needed. This is an enhancement for future releases. Closes scylladb/scylladb#25236 * github.com:scylladb/scylladb: docs/cql: update documentation for default replication strategy test/cqlpy: add keyspace creation default strategy test cql3: add default replication strategy to `create_keyspace_statement`	2025-08-14 11:18:36 +03:00
Andrzej Jackowski	bf8be01086	test: audit: add logging of get_audit_log_list and set_of_rows_before Without those logs, analysing some test failures is difficult. Refs: scylladb/scylladb#25442 Closes scylladb/scylladb#25485	2025-08-14 09:53:05 +03:00
Ernest Zaslavsky	dd51e50f60	s3_client: add memory fallback in `chunked_download_source` Introduce fallback logic in `chunked_download_source` to handle memory exhaustion. When memory is low, feed the `deque` with only one uncounted buffer at a time. This allows slow but steady progress without getting stuck on the memory semaphore. Fixes: https://github.com/scylladb/scylladb/issues/25453 Fixes: https://github.com/scylladb/scylladb/issues/25262 Closes scylladb/scylladb#25452	2025-08-14 09:52:10 +03:00
Michał Chojnowski	72818a98e0	tests/lib: extract nondeterministic_choice_stack to test/lib This util will be used in another test file in later commit, so hoist it to `test/lib`.	2025-08-14 02:06:34 +02:00
Michał Chojnowski	0ffe336887	sstables/trie/trie_traversal: extract comparable_bytes_iterator to its own file In a later commit, this concept will be used in a place that's not dependent on trie traversal routines. So extract it to its own header.	2025-08-14 02:06:34 +02:00
Michał Chojnowski	30dad06c9a	sstables/mx: move clustering_info from writer.cc to types.hh We will use this type as the input to the BTI row index writer. Since it will be implemented in other translation units, the definition of the type has to be moved to a header.	2025-08-14 02:06:33 +02:00
Michał Chojnowski	347e5c534a	sstables/trie: allow `comparable_bytes_iterator` to return a mutable span `comparable_bytes_iterator` is a concept for iterating over the fragments of a key translated to BTI encoding. In `trie_traversal.hh`, those fragments are `std::span<const std::byte>`, because the traversal routines have no use for modifying the fragments. But in a later commit we will also have to deal with encoded keys during row index writes, and the row index writer will want to modify the bytes, to nudge the mismatch byte by one in order to obtain a key separator. Let's extend this concept to allow both span<const byte> and span<byte>, so that it can be used in both situations.	2025-08-14 01:54:57 +02:00
Michał Chojnowski	4fb841346b	dht/ring_position: add ring_position_view::weight() This will be useful for the translation of ring positions to BTI encoding. We will use it in a later commit.	2025-08-14 01:54:57 +02:00
Amnon Heiman	a46671ac59	alternator/test_returnconsumedcapacity.py: Test forced read before write This patch introduces two test cases to validate the effect of `alternator_force_read_before_write` on WCU (Write Capacity Unit) tracking: - The first test verifies that when a smaller item replaces a larger one, the WCU reflects the size of the larger item, as expected when read-before-write is enabled. - The second test covers `BatchWriteItem` with `alternator_force_read_before_write` enabled, ensuring that WCU is computed using the actual size of existing items.	2025-08-13 18:04:03 +03:00
Amnon Heiman	ffc7171a5f	alternator/executor.cc: DynamoDB WCU calculation in BatchWriteItem using read-before-write Alternator's internal model differs from DynamoDB, which can result in lower WCU (Write Capacity Unit) estimates for some operations. While this is often acceptable, accurate WCU tracking is occasionally required. This patch enables a DynamoDB like WCU calculation for `BatchWriteItem` when the `alternator_force_read_before_write` configuration is enabled, by reading existing items before applying changes. WCU calculation in `BatchWriteItem` is now performed in three stages: 1. During the initial scan, no calculation is done, just the pointers are collected. 2. If read-before-write is enabled, each item is read again and its prior size is compared to the new value. WCU is based on the larger of the two. The updated size is stored in the mutation building array. 3. Regardless of read-before-write, metrics are updated and consumed capacity units are returned if requested. This is done in a loop before sending the mutations. For performance, reads in `BatchWriteItem` use consistency level LOCAL_ONE. These changes increase WCU DynamoDB compatibility in batch operations, but add overhead and should be enabled only when needed. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-08-13 18:03:55 +03:00
Wojciech Mitros	2ece08ba43	test: run mv tests depending on metrics on a standalone instance The test_base_partition_deletion_with_metrics test case (and the batch variant) uses the metric of view updates done during its runtime to check if we didn't perform too many of them. The test runs in the cqlpy suite, which runs all test cases sequentially on one Scylla instance. Because of this, if another test case starts a process which generates view updates and doesn't wait for it to finish before it exists, we may observe too many view updates in test_base_partition_deletion_with_metrics and fail the test. In all test cases we make sure that all tables that were created during the test are dropped at the end. However, that doesn't stop the view building process immediately, so the issue can happen even if we drop the view. I confirmed it by adding a test just before test_base_partition_deletion_with_metrics which builds a big materialized view and drops it at the end - the metrics check still failed. The issue could be caused by any of the existing test cases where we create a view and don't wait for it to be built. Note that even if we start adding rows after creating the view, some of them may still be included in the view building, as the view building process is started asynchronously. In such a scenario, the view building also doesn't cause any issues with the data in these tests - writes performed after view creation generate view updates synchronously when they're local (and we're running a single Scylla server), the corresponding view udpates generated during view building are redundant. Because we have many test cases which could be causing this issue, instead of waiting for the view building to finish in every single one of them, we move the susceptible test cases to be run on separate Scylla instances, in the "cluster" suite. There, no other test cases will influence the results. Fixes https://github.com/scylladb/scylladb/issues/20379 Closes scylladb/scylladb#25209	2025-08-13 15:08:50 +03:00
Petr Gusev	3f287275b8	test_tablets_lwt: add test_error_message_for_timeout_due_to_uncertainty	2025-08-13 14:03:57 +02:00
Petr Gusev	8bd936b72c	storage_proxy: preserve accept error messages	2025-08-13 13:43:12 +02:00
Petr Gusev	00c25d396f	storage_proxy: preserve prepare error message	2025-08-13 13:43:12 +02:00
Petr Gusev	0724fafe47	storage_proxy: fix log message	2025-08-13 13:40:09 +02:00
Petr Gusev	ffaee20b62	exceptions.hh: fix message argument passing The message argument is usually taken from a temporary variable constructed with the format() function. It is more efficient to pass it by value and move it along the constructor chain.	2025-08-13 13:39:52 +02:00
Benny Halevy	50abeb1270	locator: util: optimize describe_ring This change includes basic optimizations to locator::describe_ring, mainly caching the per-endpoint information in an unordered_map instead of looking them up in every inner-loop. This yields an improvement of 20% in cpu time. With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per node, yielding 11520 ranges and 9 replicas per range, describe_ring took Before: 30 milliseconds (2.6 microseconds per range) After: 24 milliseconds (2.1 microseconds per range) Add respective unit test of describe_ring for tablets. A unit test for vnodes already exists in test/nodetool/test_describering.py Fixes #24887 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-13 12:42:25 +03:00
Benny Halevy	60d2cc886a	locator: util: construct_range_to_endpoint_map: pass is_vnode=true to get_natural_replicas First, let get_all_ranges return all vnode ranges with a corrected wrapping range covering the [last token, first token) range, such that all ranges start tokens are vndoe tokens and must be in the vnode replication map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-13 12:42:23 +03:00
Benny Halevy	195d02d64e	vnode_effective_replication_map: do_get_replicas: throw internal error if token not found in map Prevent a crash, especially in the is_vnode=true case, if the key_token is not found in the map. Rather than the undefined behavior when dereferencing the end() iterator, throw an internal error with additional logging about the search logic and parameters. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-13 12:41:03 +03:00
Benny Halevy	4d646636f2	locator: effective_replication_map: get_natural_replicas: get is_vnode param Some callers, like `construct_range_to_endpoint_map` for describe_ring, or `get_secondary_ranges` for alternator ttl pass vnode tokens (the vnodes' start token), and therefore can benefit from the fast lookup path in `vnode_effective_replication_map::do_get_replicas`. Otherwise the vnode token is binary-searched in sorted_tokens using token_metadata::first_token(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-13 12:41:00 +03:00
Benny Halevy	f22a870a04	test: cluster: test_repair: add test_vnode_keyspace_describe_ring Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-13 12:39:40 +03:00
Yaniv Michael Kaul	b75799c21c	skip instead of xfail test_change_replication_factor_1_to_0 It's a waste of good machine time to xfail this rather than just skip. It takes >3m just to run the test and xfail. We have a marker for it, we know why we skip it. Fixes: https://github.com/scylladb/scylladb/issues/25310 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#25311	2025-08-13 10:32:22 +02:00
Ernest Zaslavsky	380c73ca03	s3_client: make memory semaphore acquisition abortable Add `abort_source` to the `get_units` call for the memory semaphore in the S3 client, allowing the acquisition process to be aborted. Fixes: https://github.com/scylladb/scylladb/issues/25454 Closes scylladb/scylladb#25469	2025-08-13 08:48:55 +03:00
Jenkins Promoter	2de91d43d5	Update pgo profiles - x86_64	2025-08-13 07:52:17 +03:00
Jenkins Promoter	647d9fe45d	Update pgo profiles - aarch64	2025-08-13 07:43:38 +03:00
Dario Mirovic	2ac37b4fde	docs/cql: update documentation for default replication strategy Update create-keyspace-statement section of ddl.rst since `class` is no longer mandatory. Add an example for keyspace creation without specifying `class`. Refs: #16029	2025-08-13 01:52:00 +02:00
Dario Mirovic	ef63d343ba	test/cqlpy: add keyspace creation default strategy test Add a test case for create keyspace default replication strategy. It is expected that the default replication strategy is `NetworkTopologyStrategy`. Refs: #16029	2025-08-13 01:52:00 +02:00
Dario Mirovic	bc8bb0873d	cql3: add default replication strategy to `create_keyspace_statement` When creating a new keyspace, both replication strategy and replication factor must be stated. For example: `CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 3 };` This syntax is verbose, and in all but some testing scenarios `NetworkTopologyStrategy` is used. This patch allows skipping replication strategy name, filling it with `NetworkTopologyStrategy` when that happens. The following syntax is now valid: `CREATE KEYSPACE ks WITH REPLICATION = { 'replication_factor' : 3 };` and will give the same result as the previous, more explicit one. Fixes #16029	2025-08-13 01:51:53 +02:00
Botond Dénes	72b2bbac4f	pgo/pgo.py: use tablet repair API for repair Since `a1d7722` tablet keyspaces are not allowed to be repaired via the old /storage_service/repair_async/{keyspace} API, instead the new /storage_service/tablets/repair API has to be used. Adjust the repair code and also add await_completion=true: the script just waits for the repair to finish immediately after starting it. Closes scylladb/scylladb#25455	2025-08-12 20:32:19 +03:00
Petr Gusev	ff89c03c7f	exceptions: add constructors that accept explicit error messages To improve debuggability, we need to propagate original error messages from Paxos verbs to the user. This change adds constructors that take an error message directly, enabling better error reporting. Additionally, functions such as write_timeout_to_read, write_failure_to_read etc are updated to use these message-based constructors. These functions are used in storage_proxy::cas to convert between different error types, and without this change, they could lose the original error message during conversion.	2025-08-12 16:31:05 +02:00
Taras Veretilnyk	b7097b2993	database_test: fix abandoned futures in test_drop_quarantined_sstables The lambda passed to do_with_cql_env_thread() in test_drop_quarantined_sstables was mistakenly written as a coroutine. This change replaces co_await with .get() calls on futures and changes lambda return type to void. Fixes scylladb/scylladb#25427 Closes scylladb/scylladb#25431	2025-08-12 13:31:06 +03:00
Patryk Jędrzejczak	a1b2f99dee	Merge 'test: test_mv_backlog: fix to consider internal writes' from Michael Litvak The PR fixes a test flakiness issue in test_mv_backlog related to reading metrics. The first commit fixes a more general issue in the ScyllaMetrics helper class where it doesn't return the value of all matching lines when a specific shard is requested, but it breaks after the first match. The second commit fixes a test issue where it expects exactly one write to be throttled, not taking into account other internal writes that may be executed during this time. Fixes https://github.com/scylladb/scylladb/issues/23139 backport to improve CI stability - test only change Closes scylladb/scylladb#25279 * https://github.com/scylladb/scylladb: test: test_mv_backlog: fix to consider internal writes test/pylib/rest_client: fix ScyllaMetrics filtering	2025-08-12 10:05:15 +02:00
Wojciech Przytuła	7600ccfb20	Fix link to ScyllaDB manual The link would point to outdated OS docs. I fixed it to point to up-to-date Enterprise docs. Closes scylladb/scylladb#25328	2025-08-12 10:33:06 +03:00
Avi Kivity	ac1f6aa0de	auth: resource: simplify some range transformations Supply the member function directly to std::views::transform, rather than going through a lambda. Closes scylladb/scylladb#25419	2025-08-12 10:30:06 +03:00
Karol Nowacki	22a133df9b	service/vector_store_client: Add live configuration update support Enable runtime updates of vector_store_uri configuration without requiring server restart. This allows to dynamically enable, disable, or switch the vector search node endpoint on the fly.	2025-08-12 08:12:53 +02:00
Karol Nowacki	152274735e	test/boost/vector_store_client_test.cc: Refactor vector store client test Consolidate consecutive setup functions into a dedicated helper. Extract test table creation into a separate function. Remove redundant assertions to improve clarity.	2025-08-12 08:12:53 +02:00
Karol Nowacki	858c423501	service/vector_store_client: Refactor host_port struct created This new struct groups the host and port.	2025-08-12 08:12:53 +02:00
Karol Nowacki	dd147cd8e5	service/vector_store_client: Refactor HTTP request creation Introduce lightweight wrapper for seastar::http::experimental::client This wrapper simplifies request creation by automatically injecting the host name.	2025-08-12 08:12:53 +02:00
Tomasz Grabiec	9fd312d157	Merge 'row_cache: add memtable overlap checks elision optimization for tombstone gc' from Botond Dénes https://github.com/scylladb/scylladb/issues/24962 introduced memtable overlap checks to cache tombstone GC. This was observed to be very strict and greatly reduce the effectiveness of tombstone GC in the cache, especially for MV workloads, which regularly recycle old timestamp into new writes, so the memtable often has smaller min live timestamp than the timestamp of the tombstones in the cache. When creating a new memtable, save a snapshot of the tombstone gc state. This snapshot is used later to exclude this memtable from overlap checks for tombstones, whose token have an expiry time larger than that of the tombstone, meaning: all writes in this memtable were produced at a point in time when the current tombstone has already expired. This has the following implications: * The partition the tombstone is part of was already repaired at the time the memtable was created. * All writes in the memtable were produced after this tombstone's expiry time, these writes cannot be possibly relevant for this tombstone. Based on this, such memtables are excluded from the overlap checks. With adequately frequent memtable flushes -- so that the tombstone gc state snapshot is refreshed -- most memtables should be excluded from overlap checks, greatly helping the cache's tombstone GC efficiency. Fixes: https://github.com/scylladb/scylladb/issues/24962 Fixes a regression introduced by https://github.com/scylladb/scylladb/pull/23255 which was backported to all releases, needs backport to all releases as well Closes scylladb/scylladb#25033 * github.com:scylladb/scylladb: docs/dev/tombstone.md: document the memtable overlap check elision optimization test/boost/row_cache_test: add test for memtable overlap check elision db/cache_mutation_reader: obtain gc-before and min-live-ts lazily mutation/mutation_compactor: use max_purgeable::can_purge and max_purgeable::purge_result db/cache_mutation_reader: use max_purgeable::can_purge() replica/table: get_max_purgeable_fn_for_cache_underlying_reader(): use max_purgable::combine() replica/database: memtable_list::get_max_purgeable(): set expiry-treshold compaction/compaction_garbage_collector: max_purgeable: add expiry_treshold replica/table: propagate gc_state to memtable_list replica/memtable_list: add tombstone_gc_state* member replica/memtable: add tombstone_gc_state_snapshot tombstone_gc: introduce tombstone_gc_state_snapshot tombstone_gc: extract shared state into shared_tombstone_gc_state tombstone_gc: per_table_history_maps::_group0_gc_time: make it a value tombstone_gc: fold get_group0_gc_time() into its caller tombstone_gc: fold get_or_create_group0_gc_time() into update_group0_refresh_time() tombstone_gc: fold get_or_create_repair_history_for_table() into update_repair_time() tombstone_gc: refactor get_or_greate_repair_history_for_table() replica/memtable_list: s/min_live_timestamp()/get_max_purgeable()/ db/read_context: return max_purgeable from get_max_purgeable() compaction/compaction_garbage_collector: add formatter for max_purgeable mutation: move definition of gc symbols to compaction.cc compaction/compaction_garbage_collector: refactor max_purgeable into a class test/boost/row_cache_test: refactor test_populating_reader_tombstone_gc_with_data_in_memtable test: rewrite test_compacting_reader_tombstone_gc_with_data_in_memtable in C++ test/boost/row_cache_test: refactor cache tombstone GC with memtable overlap tests	2025-08-11 23:54:59 +02:00
Michał Chojnowski	3017dbb204	sstables/trie: add trie traversal routines `trie::node_reader`, added in a previous series, contains encoding-aware logic for traversing a single node (or a batch of nodes) during a trie search. This commits adds encoding-agnostic functions which drive the the `trie::node_reader` in a loop to traverse the whole branch. Together, the added functions (`traverse`, `step`, `step_back`) and the data structure they modify (`ancestor_trail`) constitute a trie cursor. We might later wrap them into some `trie_cursor` class, but regardless of whether we are going to do that, keeping them (also) as free functions makes them easier to test. Closes scylladb/scylladb#25396	2025-08-11 19:15:09 +03:00
Botond Dénes	660ea9202a	docs/dev/tombstone.md: document the memtable overlap check elision optimization	2025-08-11 17:20:12 +03:00
Botond Dénes	65c770f21a	test/boost/row_cache_test: add test for memtable overlap check elision	2025-08-11 17:20:12 +03:00
Botond Dénes	7adbb1bd17	db/cache_mutation_reader: obtain gc-before and min-live-ts lazily Obtaining the gc-before time, or the min-live timestamps (with the expiry threshold) is not always trivial, so defer it until we know it is needed. Not all reads will attempt to garbage-collect tombstones, these reads can now avoid this work. The downside is that the partition key has to be copied and stored, as it is necessary for obtaining the min-live timestamp later.	2025-08-11 17:20:12 +03:00
Botond Dénes	f4b0c384fb	mutation/mutation_compactor: use max_purgeable::can_purge and max_purgeable::purge_result Use the optimized can_purge() check instead of the old stricter direct timestamp comparison method.	2025-08-11 17:20:12 +03:00
Botond Dénes	92e8d2f9b2	db/cache_mutation_reader: use max_purgeable::can_purge() Use the optimized can_purge() check instead of the old stricter direct timestamp comparison method.	2025-08-11 17:20:12 +03:00
Botond Dénes	4e15d32151	replica/table: get_max_purgeable_fn_for_cache_underlying_reader(): use max_purgable::combine() To combine the max purgable values, instead of just combining the timestamp values. The former way is still correct, but loses the timestamp explosion optimization, which allows the cache reader to drop timestamps from the overlap checks.	2025-08-11 17:20:12 +03:00
Botond Dénes	bd32d41cad	replica/database: memtable_list::get_max_purgeable(): set expiry-treshold Use the newly introduced expiry_treshold field of max_purgeable, to help exclude memtables from the overlap check if possible.	2025-08-11 17:20:12 +03:00
Botond Dénes	cfac9691ff	compaction/compaction_garbage_collector: max_purgeable: add expiry_treshold Allow possibly avoiding overlap checks in the case where the source of the min-live timestamp is known to only contain data which was written after expiry treshold. Expiry treshold is the upper bound of tombstone.deletion_time that was already expired at the time of obtaining this expiry treshold value. Meaning that any write originating from after this point in time, was generated at a time when such tombstone was already expired. Hence these writes are not relevant for the purposes of overlap checks with the tombstone and so their min-live timestamp can be ignored. This is important for MV workloads, where writes generated now can have timestamps going far back in time, possibly blocking tombstone GC of much older [shadowable] tombstones.	2025-08-11 17:20:11 +03:00
Patryk Jędrzejczak	e14c5e3890	Merge 'raft: enforce odd number of voters in group0' from Emil Maskovsky raft: enforce odd number of voters in group0 Implement odd number voter enforcement in the group0 voter calculator to ensure proper Raft consensus behavior. Raft consensus requires a majority of voters to make decisions, and odd numbers of voters is preferred because an even number doesn't add additional reliability but introduces the risk of scenarios where no group can make progress. If an even number of voters is divided into two groups of equal size during a network partition, neither group will have majority and both will be unable to commit new entries. With an odd number of voters, such equal partition scenarios are impossible (unless the network is partitioned into at least three groups). Fixes: scylladb/scylladb#23266 No backport: This is a new change that is to be only deployed in the new version, so it will not be backported. Closes scylladb/scylladb#25332 * https://github.com/scylladb/scylladb: raft: enforce odd number of voters in group0 test/raft: adapt test_tablets_lwt.py for odd voter number enforcement test/raft: adapt test_raft_no_quorum.py for odd voter enforcement	2025-08-11 15:44:21 +02:00
Benny Halevy	23ac80fc6b	utils: stall_free: detect clear_gently method of const payload types Currently, when a container or smart pointer holds a const payload type, utils::clear_gently does not detect the object's clear_gently method as the method is non-const and requires a mutable object, as in the following example in class tablet_metadata: ``` using tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>; using table_to_tablet_map = std::unordered_map<table_id, tablet_map_ptr>; ``` That said, when a container is cleared gently the elements it holds are destroyed anyhow, so we'd like to allow to clear them gently before destruction. This change still doesn't allow directly calling utils::clear_gently an const objects. And respective unit tests. Fixes #24605 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-11 14:22:01 +03:00
Benny Halevy	cb9db2f396	utils: stall_free: clear gently a foreign shared ptr only when use_count==1 Unlike clear_gently of SharedPtr, clear_gently of a `foreign_ptr<shared_ptr<T>>` calls clear_gently on the contained object even if it's still shared and may still be in use. This change examines the foreign shared pointer's use_count and calls clear_gently on the shard object only when its use_count reaches 1. Fixes #25026 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-11 14:21:32 +03:00
Tomasz Grabiec	f7c001deff	Merge 'key: clustering_bounds_comparator: avoid thread_local initialization guard overhead' from Avi Kivity I noticed clustering_bounds_comparator was running an unnecessary thread_local initialization guard. This series switches the variable to constinit initialization, removing the guard. Performance measurements (perf-simple-query) show an unimpressive 20 instruction per op reduction. However, each instruction counts! Before: ``` throughput: mean= 203642.54 standard-deviation=1102.99 median= 204328.69 median-absolute-deviation=955.56 maximum=204624.13 minimum=202222.19 instructions_per_op: mean= 42097.59 standard-deviation=40.07 median= 42111.83 median-absolute-deviation=30.65 maximum=42139.88 minimum=42044.91 cpu_cycles_per_op: mean= 22664.81 standard-deviation=131.28 median= 22581.10 median-absolute-deviation=111.57 maximum=22832.30 minimum=22553.24 ``` After: ``` throughput: mean= 204397.73 standard-deviation=2277.71 median= 204942.95 median-absolute-deviation=2191.54 maximum=207588.30 minimum=202162.80 instructions_per_op: mean= 42087.21 standard-deviation=27.30 median= 42092.75 median-absolute-deviation=20.33 maximum=42108.33 minimum=42041.51 cpu_cycles_per_op: mean= 22589.79 standard-deviation=219.24 median= 22544.82 median-absolute-deviation=191.98 maximum=22835.11 minimum=22303.52 ``` (Very) minor performance improvement, no backport suggestd. Closes scylladb/scylladb#25259 * github.com:scylladb/scylladb: keys: clustering_bounds_comparator: make thread_local _empty_prefix constinit keys: make empty creation clustering_key_prefix constexpr managed_bytes: make empty managed_bytes constexpr friendly keys: clustering_bounds_comparator: make _empty_prefix a prefix	2025-08-11 13:20:38 +02:00
Anna Stuchlik	1322f301f6	doc: add support for RHEL 10 This commit adds RHEL 10 to the list of supported platforms. Fixes https://github.com/scylladb/scylladb/issues/25436 Closes scylladb/scylladb#25437	2025-08-11 13:13:37 +02:00
Israel Fruchter	2da26d1fc1	Update tools/cqlsh submodule (v6.0.26) * tools/cqlsh 02ec7c57...aa1a52c1 (6): > build-push.yaml: upgrade cibuildwheel to latest > build-push.yml: skip python 3.8 and PyPy builds > cqlshlib: make NetworkTopologyStrategy default for autocomplete > default to setuptools_scm based version when not packaged > chore(deps): update pypa/cibuildwheel action to v2.23.0 Closes scylladb/scylladb#25420	2025-08-11 13:07:47 +03:00
Artsiom Mishuta	dac04a5b97	fix(test.py) incorrect markers argument in boost tests pytest parkers argument can be space separated like "not unstable" to pass such argument propperly in CLI(bash) command we should use double quates due to using shlex.split with space separation while we are not support markers in C++ tests we are passig all pytest arguments tested locally on command: ./tools/toolchain/dbuild ./test.py --markers="not unstable" test/boost/auth_passwords_test.cc before change: no tests ran in 1.12s after: 8 passed in 2.45s Closes scylladb/scylladb#25394	2025-08-11 10:43:34 +03:00
Patryk Jędrzejczak	7b77c6cc4a	docs: Raft recovery procedure: recommend verifying participation in Raft recovery This instruction adds additional safety. The faster we notice that a node didn't restart properly, the better. The old gossip-based recovery procedure had a similar recommendation to verify that each restarting node entered `RECOVERY` mode. Fixes #25375 This is a documentation improvement. We should backport it to all branches with the new recovery procedure, so 2025.2 and 2025.3. Closes scylladb/scylladb#25376	2025-08-11 09:21:29 +03:00
Avi Kivity	f49b63f696	tools: toolchain: dbuild: forward container registry credentials Docker hub rate-limits unauthenticated image pulls, so forward the host's credentials to the container. This prevents rate limit errors when running nested containers. Try the locations for the credentials in order and bind-mount the first that exists to a location that gets picked up. Verified with `podman login --get-login docker.io` in the container. Closes scylladb/scylladb#25354	2025-08-11 09:05:57 +03:00
Botond Dénes	3b1f414fcf	replica/table: propagate gc_state to memtable_list	2025-08-11 07:09:19 +03:00
Botond Dénes	9d00d7e08d	replica/memtable_list: add tombstone_gc_state* member To be passed down to the memtable.	2025-08-11 07:09:19 +03:00
Botond Dénes	ef8a21b4cf	replica/memtable: add tombstone_gc_state_snapshot To be used for possibly excluding the memtable from overlap checks with the cache/sstables, in memtable_list::get_max_purgeable().	2025-08-11 07:09:19 +03:00
Botond Dénes	ab633590f1	tombstone_gc: introduce tombstone_gc_state_snapshot Returns gc-before times, identical to what tombstone_gc_state would have returned at the point of taking the snapshot.	2025-08-11 07:09:14 +03:00
Botond Dénes	614d17347a	tombstone_gc: extract shared state into shared_tombstone_gc_state Instead of storing it partially in tombstone_gc and partially in an external map. Move all external parts into the new shared_tombstone_gc_state. This new class is responsible for keeping and updating the repair history. tombstone_gc_state just keeps const pointers to the shared state as before and is only responsible for querying the tombstone gc before times. This separation makes the code easier to follow and also enables further patching of tombstone_gc_state.	2025-08-11 07:09:14 +03:00
Botond Dénes	3a54379330	tombstone_gc: per_table_history_maps::_group0_gc_time: make it a value No reason for it to be a shared pointer, or even a pointer at all. When the pointer is not initialized, gc_clock::time_point::min() is used as the group0 gc time, so we can just replace with a gc_clock::time_point value initialized to min() and do away with an unnecessary indirection as well as an allocation. This latter will be even more important after the next patches.	2025-08-11 07:09:14 +03:00
Botond Dénes	aa43396aac	tombstone_gc: fold get_group0_gc_time() into its caller It has just one caller. This fold makes the code simpler and facilitates further patching.	2025-08-11 07:09:14 +03:00
Botond Dénes	faa2b5b4d4	tombstone_gc: fold get_or_create_group0_gc_time() into update_group0_refresh_time() Its only caller. Makes the code simpler and facilitates further patching.	2025-08-11 07:09:13 +03:00
Botond Dénes	e9d211bbcd	tombstone_gc: fold get_or_create_repair_history_for_table() into update_repair_time() Its only caller. Makes the code simpler and facilitates further patching.	2025-08-11 07:09:13 +03:00
Botond Dénes	b9f0cabead	tombstone_gc: refactor get_or_greate_repair_history_for_table() This method has 3 lookups into the reconcile history maps in the worst case. Reduce to just one. Makes the code more streamlined and prepares the groundwork for the next patch.	2025-08-11 07:09:13 +03:00
Botond Dénes	1d3a3163a3	replica/memtable_list: s/min_live_timestamp()/get_max_purgeable()/ Also change to the return type to max_purgeable, instead of raw timestamp. Prepares for further patching of this code.	2025-08-11 07:09:13 +03:00
Botond Dénes	5d69ef5e8b	db/read_context: return max_purgeable from get_max_purgeable() Instead of just the timestamp. Soon more fields will be used.	2025-08-11 07:09:13 +03:00
Botond Dénes	1d2cc6ef12	compaction/compaction_garbage_collector: add formatter for max_purgeable It is more than just a timestamp already, and it is about to receive some additional fields.	2025-08-11 07:09:13 +03:00
Botond Dénes	6078c15116	mutation: move definition of gc symbols to compaction.cc We are used to symbols definition being grouped in one .cc file, but a symbol declaration and definition living in separate modules (subfolders) is surprising. Relocate always_gc, never_gc, can_always_purge and can_never_purge to compaction/compaction.cc, from mutatiobn/mutation_partition.cc. The declarations of these symbols is in compaction/compaction_garbage_collector.hh.	2025-08-11 07:09:13 +03:00
Botond Dénes	ef7d49cd21	compaction/compaction_garbage_collector: refactor max_purgeable into a class Make members private, add getters and constructors. This struct will get more functionality soon, so class is a better fit.	2025-08-11 07:09:13 +03:00
Botond Dénes	c150bdd59c	test/boost/row_cache_test: refactor test_populating_reader_tombstone_gc_with_data_in_memtable This test currently uses gc_grace_seconds=0. The introduction of memtable overlap elision will break these tests because the optimization is always active with this tombstone-gc. Switch the tests to use tombstone-gc=repair, which allows for greater control over when the memtable overlap elision is triggered. This requires a move to vnodes, as tombstone-gc=repair doesn't work with RF=1 currently, and using RF=3 won't work with tablets.	2025-08-11 07:09:13 +03:00
Botond Dénes	c052f2ad1d	test: rewrite test_compacting_reader_tombstone_gc_with_data_in_memtable in C++ This test will soon need to be changed to use tombstone-gc=repair. This cannot work as of now, as the test uses a single-node cluster. The options are the following: * Make it use more than one nodes * Make repair work with single node clusters * Rewrite in C++ where repair can be done synthetically We chose the last option, it is the simplest one both in terms of code and runtime footprint. The new test is in test/boost/row_cache_test.cc Two changes were done during the migration * Change the name to test_populating_reader_tombstone_gc_with_data_in_memtable to better express which cache component this test is targetting; * Use NullCompactionStrategy on the table instead of disabling auto-compaction.	2025-08-11 07:09:13 +03:00
Botond Dénes	e4c048ada1	test/boost/row_cache_test: refactor cache tombstone GC with memtable overlap tests These tests currently use tombstone-gc=immediate. The introduction of memtable overlap elision will break these tests because the optimization is always active with this tombstone-gc. Switch the tests to use tombstone-gc=repair, which allows for greater control over when the memtable overlap elision is triggered. This requires a move to vnodes, as tombstone-gc=repair doesn't work with RF=1 currently, and using RF=3 won't work with tablets.	2025-08-11 07:09:13 +03:00
Asias He	2ecd42f369	feature_service: Add TABLET_INCREMENTAL_REPAIR feature	2025-08-11 10:10:08 +08:00
Asias He	b226ad2f11	tablet_allocator: Add tablet_force_tablet_count_increase and decrease It is useful to increase and decrease the tablet count in the test for tablet split and merge testing.	2025-08-11 10:10:08 +08:00
Asias He	1bf59ebba0	repair: Add incremental helpers This adds the helpers which are needed by both repair and compaction to add incremental repair support.	2025-08-11 10:10:08 +08:00
Asias He	b86f554760	sstable: Add being_repaired to sstable This in-memory filed is set by incremental repair when the sstable participates the repair.	2025-08-11 10:10:08 +08:00
Asias He	f50cd94429	sstables: Add set_repaired_at to metadata_collector	2025-08-11 10:10:08 +08:00
Asias He	ac9d33800a	mutation_compactor: Introduce add operator to compaction_stats It is needed to combine two compactions.	2025-08-11 10:10:07 +08:00
Asias He	5377f87e5a	tablet: Add sstables_repaired_at to system.tablets table It is used to store the repaired_at for each tablet.	2025-08-11 10:10:07 +08:00
Asias He	8db18ac74e	test: Fix drain api in task_manager_client.py The POST method should be used.	2025-08-11 10:10:07 +08:00
Avi Kivity	6daa6178b1	scripts: pull_github_pr.sh: reject unintended submodule changes It is easy for submodule changes to slip through during rebase (if the developer uses the terrible `git add -u` command) and for a maintainer to miss it (if they don't go over each change after a rebase). Protect against such mishaps by checking if a submodule was updated (or .gitmodules itself was changes) and aborting the operation. If the pull request title contains "submodule", assume the operation was intended. Allow bypassing the check with --allow-submodule. Closes scylladb/scylladb#25418	2025-08-10 11:48:34 +03:00
Michael Litvak	276a09ac6e	test: test_mv_backlog: fix to consider internal writes The test executes a single write, fetching metrics before and after the write, and expects the total throttled writes count to be increased exactly by one. However, other internal writes (compaction for example) may be executed during this time and be throttled, causing the metrics to be increased by more than expected. To address this, we filter the metrics by the scheduling group label of the user write, to filter out the compaction writes that run in the compaction scheduling group. Fixes scylladb/scylladb#23139	2025-08-10 10:31:02 +02:00
Michael Litvak	5c28cffdb4	test/pylib/rest_client: fix ScyllaMetrics filtering In the ScyllaMetrics `get` function, when requesting the value for a specific shard, it is expected to return the sum of all values of metrics for that shard that match the labels. However, it would return the value of the first matching line it finds instead of summing all matching lines. For example, if we have two lines for one shard like: some_metric{scheduling_group_name="compaction",shard="0"} 1 some_metric{scheduling_group_name="sl:default",shard="0"} 2 The result of this call would be 1 instead of 3: get('some_metric', shard="0") We fix this to sum all matching lines. The filtering of lines by labels is fixed to allow specifying only some of the labels. Previously, for the line to match the filter, either the filter needs to be empty, or all the labels in the metric line had to be specified in the filter parameter and match its value, which is unexpected, and breaks when more labels are added. We also simplify the function signature and the implementation - instead of having the shard as a separate parameter, it can be specified as a label, like any other label.	2025-08-10 10:16:00 +02:00
Avi Kivity	c2a2e11c40	Merge 'Prepare the way for incremental repair' from Botond Dénes With incremental repair, each replica::compaction_group will have 3 logical compaction groups, repaired, repairing and unrepaired. The definition of group is a set of sstables that can be compacted together. The logical groups will share the same instance of sstable_set, but each will have its own logical sstable set. Existing compaction::table_state is a view for a logical compaction group. So it makes sense that each replica::compaction_group will have multiple views. Each view will provide to compaction layer only the sstables that belong to it. That way, we preserve the existing interface between replica and compaction layer, where each compaction::table_state represents a single logical group. The idea is that all the incremental repair knowledge is confined to repair and replica layer, compaction doesn't want to know about it, it just works on logical groups, what each represents doesn't matter from the perspective of the subsystem. This is the best way forward to not violate layers and reduce the maintenance burden in the long run. We also proceed to rename table_state to compaction_group_view, since it's a better description. Working with multiple terms is confusing. The placeholder for implementing the sstable classifier is also left in tablet_storage_group_manager, by the time being, all sstables will go to the unrepaired logical set, which preserves the current behavior. New functionality, no backport required Closes scylladb/scylladb#25287 * github.com:scylladb/scylladb: test: Add test that compaction doesn't cross logical group boundary replica: Introduce views in compaction_group for incremental repair compaction: Allow view to be added with compaction disabled replica: Futurize retrieval of sstable sets in compaction_group_view treewide: Futurize estimation of pending compaction tasks replica: Allow compaction_group to have more than one view Move backlog tracker to replica::compaction_group treewide: Rename table_state to compaction_group_view tests: adjust for incremental repair	2025-08-09 17:21:17 +03:00
Emil Maskovsky	7c54401d3d	raft: enforce odd number of voters in group0 Implement odd number voter enforcement in the group0 voter calculator to ensure proper Raft consensus behavior. Raft consensus requires a majority of voters to make decisions, and odd numbers of voters is preferred because an even number doesn't add additional reliability but introduces the risk of scenarios where no group can make progress. If an even number of voters is divided into two groups of equal size during a network partition, neither group will have majority and both will be unable to commit new entries. With an odd number of voters, such equal partition scenarios are impossible (unless the network is partitioned into at least three groups). Fixes: scylladb/scylladb#23266	2025-08-08 19:49:20 +02:00
Emil Maskovsky	29ddb2aa18	test/raft: adapt test_tablets_lwt.py for odd voter number enforcement The test_lwt_timeout_while_creating_paxos_state_table was failing after implementing odd number voter enforcement in the group0 voter calculator. Previously with 2 nodes: - 2 nodes → 2 voters → stop 1 node → 1/2 voters (no quorum) → expected Raft timeout With odd voter count enforcement: - 2 nodes → 1 voter → stop 1 node → 0/1 voters → Cassandra availability error This change updates the test to use 3 nodes instead of 2, ensuring proper no-quorum scenarios: - 3 nodes → 3 voters → stop 2 nodes → 1/3 voters (no quorum) → Raft timeout The test now correctly validates LWT timeout behavior while being compatible with the odd number voter enforcement requirement.	2025-08-08 19:49:10 +02:00
Emil Maskovsky	7fc75aff3e	test/raft: adapt test_raft_no_quorum.py for odd voter enforcement Update the no-quorum cluster tests to work correctly with the new odd number voter enforcement in the group0 voter calculator. The tests now properly account for the changed voter counts when validating no-quorum scenarios.	2025-08-08 19:48:58 +02:00
Anna Stuchlik	f3d9d0c1c7	doc: add new and removed metrics to the 2025.3 upgrade guide This commit adds the list of new and removed metrics to the already existing upgrade guide from 2025.2 to 2025.3. Fixes https://github.com/scylladb/scylladb/issues/24697 Closes scylladb/scylladb#25385	2025-08-08 13:25:51 +02:00
Avi Kivity	ab45a0edb5	Update seastar submodule * seastar 60b2e7da...1520326e (36): > Merge 'http/client: Fix content length body overflow check (and a bit more)' from Pavel Emelyanov test/http: Add test for http_content_length_data_sink test/http: Implement some missing methods for memory data sink http/client: Fix content length body overflow check http/client: Fix misprint in overflow exception message > dns: Use TCP connection data_sink directly > iostream: Update "used stream" check for output_stream::detach() > Update dpdk submodule > rpc: server::process: coroutinize > iostream: Remove deprecated constructor > Merge 'foreign_ptr: add unwrap_on_owner_shard method' from Benny Halevy foreign_ptr: add unwrap_on_owner_shard method foreign_ptr: release: check_shard with SEASTAR_DEBUG_SHARED_PTR > enum: Replace static_assert() with concept > rpc: reindent connection::negotiate() > rpc: client:➿ use structured binding > rpc.cc: reindent > queue: Remove duplicating static assertion > Merge 'rpc: client: convert main loop to a coroutine' from Avi Kivity rpc: client::loop(): restore indentation rpc: client: coroutinize client::loop() rpc: client: split main loop function > Merge 'treewide: replace remaining std::enable_if with constraints' from Avi Kivity optimized_optional: replace std::enable_if with constraint log: replace std::enable_if with constraint rpc: replace std::enable_if with constraint when_all: replace std::enable_if with constraints transfer: replace std::enable_if with constraints sstring: replace std::enable_if with constraint simple-stream: replace std::enable_if with constraints shared_ptr: replace std::enable_if with constraints sharded: replace std::enable_if with constraints for sharded_has_stop sharded: replace std::enable_if with constraints for peering_sharded_service scollectd: replace std::enable_if with constraints for type inference scollectd: replace std::enable_if with constraints for ser/deser metrics: replace std::enable_if with constraints chunked_fifo: replace std::enable_if with constraint future: replace std::enable_if with constraints > websocket: Avoid sending scattered_message to output_stream > websocket: Remove unused scattered_message.hh inclusion > aio: Squash aio_nowait_supported into fs_info::nowait_works > Merge 'reactor: coroutinize spawn()' from Avi Kivity reactor: restore indentation for spawn() reactor: coroutinize spawn() > modules: export coroutine facilities > Merge 'reactor: coroutinize some file-related functions' from Avi Kivity reactor: adjust indentation reactor: coroutinize reactor::make_pipe() reactor: coroutinize reactor::inotify_add_watch() reactor: coroutinize reactor::read_directory() reactor: coroutinize reactor::file_type() reactor: coroutinize reactor::chmod() reactor: coroutinize reactor::link_file() reactor: coroutinize reactor::rename_file() reactor: coroutinize open_file_dma() > memory: inline disable_abort_on_alloc_failure_temporarily > Merge 'addr2line timing and optimizations' from Travis Downs addr2line: add basic timing support addr2line: do a quick check for 0x in the line addr2line: don't load entire file addr2line: typing fixing > posix: Replace static_assert with concept > tls: Push iovec with the help of put(vector<temporary_buffer>) > io_queue: Narrow down friendship with reactor > util: drop concepts.hh > reactor: Re-use posix::to_timespec() helper > Fix incorrect defaults for io queue iops/bandwidth > net: functions describing ssl connection > Add label values to the duplicate metrics exception > Merge 'Nested scheduling groups (CPU only)' from Pavel Emelyanov test: Add unit test for cross-sched-groups wakeups test: Add unit test for fair CPU scheduling test: Add unit test for basic supergrops manipulations test: Add perf test for context switch latency scheduling: Add an internal method to get group's supergroup reactor: Add supergroup get_shares() API reactor: Add supergroup::set_shares() API reactor: Create scheduling groups in supergroups reactor: Supergroups destroying API reactor: Supergroups creating API reactor: Pass parent pointer to task_queue from caller reactor: Wakeup queue group on child activation reactor: Add pure virtual sched_entity::run_tasks() method reactor: Make task_queue_group be sched_entity too reactor: Split task_queue_group::run_some_tasks() reactor: Count and limit supergroup children reactor: Link sched entity to its parent reactor: Switch activate(task_queue) to work on sched_entity reactor: Move set_shares() to sched_entity() reactor: Make account_runtime() work with sched_entity reactor: Make insert_activating_task_queue() work on sched_entity reactor: Make pop_active_task_queue() work on sched_entity reactor: Make insert_active_task_queue() work on sched_entity reactor: Move timings to sched_entity reactor: Move active bit to sched_entity reactor: Move shares to sched_entity reactor: Move vruntime to sched_entity reactor: Introduce sched_entity reactor: Rename _activating_task_queues -> _activating reactor: Remove local atq variable reactor: Rename _active_task_queues -> _active reactor: Move account_runtime() to task_queue_group reactor: Move vruntime update from task_queue into _group reactor: Simplify task_queue_group::run_some_tasks() reactor: Move run_some_tasks() into task_queue_group reactor: Move insert_activating_task_queues() into task_queue_group reactor: Move pop_active_task_queue() into task_queue_group reactor: Move insert_active_task_queue() into task_queue_group reactor: Introduce and use task_queue_group::activate(task_queue) reactor: Introduce task_queue_group::active() reactor: Wrap scheduling fields into task_queue_group reactor: Simplify task_queue::activate() reactor: Rename task_queue::activate() -> wakeup() reactor: Make activate() method of class task_queue reactor: Make task_queue::run_tasks() return bool reactor: Simplify task_queue::run_tasks() reactor: Make run_tasks() method of class task_queue > Fix hang in io_queue for big write ioproperties numbers > split random io buffer size in 2 options > reactor: document run_in_background > Merge 'Add io_queue unit test for checking request rates' from Robert Bindar Add unit test for validating computed params in io_queue Move `disk_params` and `disk_config_params` to their own unit Add an overload for `disk_config_params::generate_config` Closes scylladb/scylladb#25404	2025-08-08 12:24:39 +03:00
Benny Halevy	49e3b2827f	streaming: stream_blob: use the table sstable_generation_generator No need to start a local generator. Can just use the table's sstable generation generator to make new sstables now that it's stateless and doesn't depend on the highest generation found. Note that tablet_stream_files_handler used uuid generations unconditionally from inception (`4018dc7f0d`). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	de8a199f79	replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator No need to start a local sharded generator. Can just use the table's sstable generation generator to make new sstables now that it's stateless and doesn't depend on the highest generation found (including the uploaded sstables). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	13f4e27cb9	sstables: sstable_generation_generator: stop tracking highest generation It is unused by now. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	0a20834d2a	replica: table: get rid of update_sstables_known_generation It is not needed anymore. With that database::_sstable_generation_generator can be a regular member rather than optional and initialized later. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	42cb25c470	sstables: sstable_directory: stop tracking highest_generation It is not needed anymore as we always generate uuid generations. Convert sstable_directory_test_table_simple_empty_directory_scan to use the newly added empty() method instead of checking the highest generation seen. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	b01524c5a3	replica: distributed_loader: stop tracking highest_generation It is not needed anymore as we always generate uuid generations. Move highest_generation_seen(sharded<sstables::sstable_directory>& directory) to sstables/sstable_directory module. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	6cc964ef16	sstables: sstable_generation: get rid of uuid_identifiers bool class Now that all call sites enable uuid_identifiers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	43ee9c0593	sstables_manager: drop uuid_sstable_identifiers It is returning constant sstables::uuid_identifiers::yes now, so let the callers just use the constant (to be dropped in a following patch). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	0ad1898f0a	feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set The feature is supported by all live versions since version 5.4 / 2024.1. (Although up to `6da758d74c` it could be disabled using the config option) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:15 +03:00
Botond Dénes	70aa81990b	Merge 'Alternator - add the ability to write, not just read, system tables' from Nadav Har'El In commit `44a1daf` we added the ability to read Scylla system tables with Alternator. This feature is useful, among other things, in tests that want to read Scylla's configuration through the system table system.config. But tests often want to modify system.config, e.g., to temporarily reduce some threshold to make tests shorter. Until now, this was not possible This series add supports for writing to system tables through Alternator, and examples of tests using this capability (and utility functions to make it easy). Because the ability to write to system tables may have non-obvious security consequences, it is turned off by default and needs to be enabled with a new configuration option "alternator_allow_system_table_write" No backports are necessary - this feature is only intended for tests. We may later decide to backport if we want to backport new tests, but I think the probability we'll want to do this is low. Fixes #12348 Closes scylladb/scylladb#19147 * github.com:scylladb/scylladb: test/alternator: utility functions for changing configuration alternator: add optional support for writing to system table test/alternator: reduce duplicated code	2025-08-08 09:13:15 +03:00
Raphael S. Carvalho	beaaf00fac	test: Add test that compaction doesn't cross logical group boundary Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:01 +03:00
Raphael S. Carvalho	d351b0726b	replica: Introduce views in compaction_group for incremental repair Wired the unrepaired, repairing and repaired views into compaction_group. Also the repaired filter was wired, so tablet_storage_group_manager can implement the procedure to classify the sstable. Based on this classifier, we can decide which view a sstable belongs to, at any given point in time. Additionally, we made changes changes to compaction_group_view to return only sstables that belong to the underlying view. From this point on, repaired, repairing and unrepaired sets are connected to compaction manager through their views. And that guarantees sstables on different groups cannot be compacted together. Repairing view specifically has compaction disabled on it altogether, we can revert this later if we want, to allow repairing sstables to be compacted with one another. The benefit of this logical approach is having the classifier as the single source of truth. Otherwise, we'd need to keep the sstable location consistest with global metadata, creating complexity Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:00 +03:00
Raphael S. Carvalho	61cb02f580	compaction: Allow view to be added with compaction disabled Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:00 +03:00
Raphael S. Carvalho	9d3755f276	replica: Futurize retrieval of sstable sets in compaction_group_view This will allow upcoming work to gently produce a sstable set for each compaction group view. Example: repaired and unrepaired. Locking strategy for compaction's sstable selection: Since sstable retrieval path became futurized, tasks in compaction manager will now hold the write lock (compaction_state::lock) when retrieving the sstable list, feeding them into compaction strategy, and finally registering selected sstables as compacting. The last step prevents another concurrent task from picking the same sstable. Previously, all those steps were atomic, but we have seen stall in that area in large installations, so futurization of that area would come sooner or later. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:00 +03:00
Raphael S. Carvalho	20c3301a1a	treewide: Futurize estimation of pending compaction tasks This is to allow futurization of compaction_group_view method that retrieves sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:29 +03:00
Raphael S. Carvalho	af3592c658	replica: Allow compaction_group to have more than one view In order to support incremental repair, we'll allow each replica::compaction_group to have two logical compaction groups (or logical sstable sets), one for repaired, another for unrepaired. That means we have to adapt a few places to work with compaction_group_view instead, such that no logical compaction group is missed when doing table or tablet wide operations. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:29 +03:00
Raphael S. Carvalho	e78295bff1	Move backlog tracker to replica::compaction_group Since there will be only one physical sstable set, it makes sense to move backlog tracker to replica::compaction_group. With incremental repair, it still makes sense to compute backlog accounting both logical sets, since the compound backlog influences the overall read amplification, and the total backlog across repaired and unrepaired sets can help driving decisions like giving up on incremental repair when unrepaired set is almost as large as the repaired set, causing an amplification of 2. Also it's needed for correctness because a sstable can move quickly across the logical sets, and having one tracker for each logical set could cause the sstable to not be erased in the old set it belonged to; Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:29 +03:00
Raphael S. Carvalho	2c4a9ba70c	treewide: Rename table_state to compaction_group_view Since table_state is a view to a compaction group, it makes sense to rename it as so. With upcoming incremental repair, each replica::compaction_group will be actually two compaction groups, so there will be two views for each replica::compaction_group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:28 +03:00
Asias He	acc367c522	tests: adjust for incremental repair The separatation of sstables into the logical repaired and unrepaired virtual sets, requires some adjustments for certain tests, in particular for those that look at number of compaction tasks or number of sstables. The following tests need adjustment: * test/cluster/tasks/test_tablet_tasks.py * test/boost/memtable_test.cc The adjustments are done in such a way that they accomodate both the case where there is separate repaired/unrepaired states and when there isn't.	2025-08-08 06:49:17 +03:00
Andrei Chekun	5c095558b1	test.py: add timeout option for the whole run Add possibility to limit the execution time for one test in pytest Add --session-timeout to limit execution of the test.py or/and pytest session Closes scylladb/scylladb#25185	2025-08-07 21:06:14 +03:00
Avi Kivity	2b8f5d128a	Merge 'GCP Key Provider: Fix authentication issues' from Nikos Dragazis * Fix discovery of application default credentials by using fully expanded pathnames (no tildes). * Fix grant type in token request with user credentials. Fixes #25345. Closes scylladb/scylladb#25351 * github.com:scylladb/scylladb: encryption: gcp: Fix the grant type for user credentials encryption: gcp: Expand tilde in pathnames for credentials file	2025-08-07 20:50:12 +03:00
Dani Tweig	0ade762654	Adding action call to update Jira issue status Add actions that will change the relevant Jira issue status based on the linked PR changes. Closes scylladb/scylladb#25397	2025-08-07 15:55:58 +03:00
Benny Halevy	3f44dba014	sstables: make_entry_descriptor: make regex non-greedy With greedy matching, an sstable path in a snapshot directory with a tag that resembles a name-<uuid> would match the dir regular expression as the longest match, while a non-greedy regular expression would correctly match the real keyspace and table as the shortest match. Also, add a regression unit test reproducing the issue and validating the fix. Fixes #25242 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25323	2025-08-07 15:35:11 +03:00
Avi Kivity	8164f72f6e	Merge 'Separate local_effective_replication_map from vnode_effective_replication_map' from Benny Halevy Derive both vnode_effective_replication_map and local_effective_replication_map from static_effective_replication_map as both are static and per-keyspace. However, local_effective_replication_map does not need vnodes for the mapping of all tokens to the local node. Refs #22733 * No backport required Closes scylladb/scylladb#25222 * github.com:scylladb/scylladb: locator: abstract_replication_strategy: implement local_replication_strategy locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently locator: abstract_replication_map: rename make_effective_replication_map locator: abstract_replication_map: rename calculate_effective_replication_map replica: database: keyspace: rename {create,update}_effective_replication_map locator: effective_replication_map_factory: rename create_effective_replication_map locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al locator: abstract_replication_strategy: rename global_vnode_effective_replication_map keyspace: rename get_vnode_effective_replication_map dht: range_streamer: use naked e_r_m pointers storage_service: use naked e_r_m pointers alternator: ttl: use naked e_r_m pointers locator: abstract_replication_strategy: define is_local	2025-08-07 12:51:43 +03:00
Nadav Har'El	6f415b2f10	Merge 'test/cqlpy: Adjust test_describe.py to work against Cassandra' from Dawid Mędrek We adjust most of the tests in `cqlpy/test_describe.py` so that they work against both Scylla and Cassandra. This PR doesn't cover all of them, just those I authored. Refs scylladb/scylladb#11690 Backport: not needed. This is effectively a code cleanup. Closes scylladb/scylladb#25060 * github.com:scylladb/scylladb: test/cqlpy/test_describe.py: Adjust test_create_role_with_hashed_password_authorization to work with Cassandra test/cqlpy/test_describe.py: Adjust test_desc_restore to work with Cassandra test/cqlpy/test_describe.py: Mark Scylla-only tests as such	2025-08-07 12:43:04 +03:00
Patryk Jędrzejczak	161521b674	test: properly unset recovery_leader in the recovery procedure tests After changing the type of the `recovery_leader` config option from `sstring` to `UUID` in #25032, setting `recovery_leader` to an empty string became an incorrect way to unset it. The following error started to appear in the recovery procedure tests: ``` init - marshaling error: UUID string size mismatch: '' : recovery_leader ``` We fix it in this commit by removing `recovery_leader` from the config file.	2025-08-07 11:20:00 +02:00
Patryk Jędrzejczak	31372843e4	test: manager_client: allow removing a config option Currently, there is no simple way to remove an option from the server's config file in tests. One example when this is needed is removing the `recovery_leader` option on all servers during the recovery procedure. In this commit, we add a new method to `ManagerClient` that removes an option from the given server's config file.	2025-08-07 11:20:00 +02:00
Patryk Jędrzejczak	ce26896704	test: manager_client: add docstring to server_update_config	2025-08-07 11:19:54 +02:00
Avi Kivity	90eb6e6241	Merge 'sstables/trie: implement BTI node format serialization and traversal' from Michał Chojnowski This is the next part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25154 Next part: implementing a trie cursor (the "set to key, step forwards, step backwards" thing) on top of the `node_reader` added here. The new code added here is not used for anything yet, but it's posted as a separate PR to keep things reviewably small. This part implements the BTI trie node encoding, as described in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md#trie-nodes. It contains the logic for encoding the abstract in-memory `writer_node`s (added in the previous PR) into the on-disk format, and the logic for traversing the on-disk nodes during a read. New functionality, no backporting needed. Closes scylladb/scylladb#25317 * github.com:scylladb/scylladb: sstables/trie: add tests for BTI node serialization and traversal sstables/trie: implement BTI node traversal sstables/trie: implement BTI serialization utils/cached_file: add get_shared_page() utils/cached_file: replace a std::pair with a named struct	2025-08-07 12:15:42 +03:00
Amnon Heiman	2c02cb394b	executor.cc: get_previous_item with consistency level This patch extends get_previous_item so it can be used to calculate the size of a previous item. This will allow batch_get_item to obtain the size of a previous item without needing the item itself. The patch includes the following changes: * Removes the unneeded forward declaration of get_previous_item. * Extends get_previous_item to accept an explicit consistency level. * Modifies the regular get_previous_item to maintain the same functionality while calling the base implementation. * Adds a get_previous_item_size function that uses the base implementation to retrieve the size of a previous item when only the size is needed. For performance reasons, get_previous_item_size uses consistency level one. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-08-07 12:14:43 +03:00
Benny Halevy	02b922ac40	test: cql_query_test: add test_sstable_load_mixed_generation_type Test that we can load sstables with mixed, numerical and uuid generation types, and verify the expected data. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-07 12:04:23 +03:00
Benny Halevy	9b65856a26	test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils It's a generic helper that can be used by all tests. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-07 12:04:23 +03:00
Benny Halevy	7c9ce235d7	test: database_test: move table_dir helper to test/lib/test_utils It's a generic helper that can be used by all tests. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-07 12:04:23 +03:00
Nadav Har'El	d632599a92	Merge 'test.py: native pytest repeats' from Andrei Chekun Previous way of execution repeat was to launch pytest for each repeat. That was resource consuming, since each time pytest was doing discovery of the tests. Now all repeats are done inside one pytest process. Backport for 2025.3 is needed, since this functionality is framework only, and 2025.3 affected with this slow repeats as well. Closes scylladb/scylladb#25073 * github.com:scylladb/scylladb: test.py: add repeats in pytest test.py: add directories and filename to the log files test.py: rename log sink file for boost tests test.py: better error handling in boost facade	2025-08-06 18:18:03 +03:00
Dawid Pawlik	b284961a95	scripts: fetch the name of the author of the PR The `pull_github_pr.sh` script has been fetching the username from the owner of the source branch. The owner of the branch is not always the author of the PR. For example the branch might come from a fork managed by organization or group of people. This lead to having the author in merge commits refered to as `null` (if the name was not set for the group) or it mentioned a name not belonging to the author of the patch. Instead looking for the owner of the source branch, the script should look for the name of the PR's author. Closes scylladb/scylladb#25363	2025-08-06 16:45:38 +03:00
Benny Halevy	5e5e63af10	scylla-sstable: print_query_results_json: continue loop if row is disengaged Otherwise it is accessed right when exiting the if block. Add a unit test reproducing the issue and validating the fix. Fixes #25325 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25326	2025-08-06 16:44:51 +03:00
Szymon Malewski	eb11485969	test/alternator: enable more relevant logs in CI. This patch sets, for alternator test suite, all 'alternator-*' loggers and 'paxos' logger to trace level. This should significantly ease debugging of failed tests, while it has no effect on test time and increases log size only by 7%. This affects running alternator tests only with `test.py`, not with `test/alternator/run`. Closes #24645 Closes scylladb/scylladb#25327	2025-08-06 16:37:25 +03:00
Nikos Dragazis	ee92fcc078	encryption_at_rest_test: Preserve tmpdir from failing KMIP tests The KMIP tests start a local PyKMIP server and configure it to write logs in the test's temporary directory (`tmpdir`). However, the tmpdir is a RAII object that deletes the directory once it goes out of scope, causing PyKMIP server logs to be lost on test failures. To assist with debugging, preserve the whole directory if the test failed with an exception. Allow the user to disable this by setting the SCYLLA_TEST_PRESERVE_TMP_ON_EXCEPTION environment variable. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-06 16:29:19 +03:00
Benny Halevy	6dbbb80aae	locator: abstract_replication_strategy: implement local_replication_strategy Derive both vnode_effective_replication_map and local_effective_replication_map from static_effective_replication_map as both are static and per-keyspace. However, local_effective_replication_map does not need vnodes for the mapping of all tokens to the local node. Note that everywhere_replication_strategy is not abstracted in a similar way, although it could, since the plan is to get rid of it once all system keyspaces areconverted to local or tablets replication (and propagated everywhere if needed using raft group0) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:05:11 +03:00
Benny Halevy	8bde507232	locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently create_effective_replication_map need not know about the internals of vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	8d4ac97435	locator: abstract_replication_map: rename make_effective_replication_map to make_vnode_effective_replication_map_ptr since it is specific to vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	babb4a41a8	locator: abstract_replication_map: rename calculate_effective_replication_map to calculate_vnode_effective_replication_map since it is specific to vnode-based range calculations. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	34b223f6f9	replica: database: keyspace: rename {create,update}_effective_replication_map to *_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	688bd4fd43	locator: effective_replication_map_factory: rename create_effective_replication_map to create_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	cbad497859	locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al to static_effective_replication_map_ptr, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	2ab44e871b	locator: abstract_replication_strategy: rename global_vnode_effective_replication_map to global_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:49 +03:00
Benny Halevy	bd62421c05	keyspace: rename get_vnode_effective_replication_map to get_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map (both are per-keyspace). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:40:43 +03:00
Benny Halevy	33f34c8c32	dht: range_streamer: use naked e_r_m pointers Prepare for following patch that will separate the local effective replication map from vnode_effective_replication_map. The caller is responsible to keep the effective_replication_map_ptr alive while in use by low-level async functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:34:23 +03:00
Benny Halevy	d6d434b1c2	storage_service: use naked e_r_m pointers Prepare for following patch that will separate the local effective replication map from vnode_effective_replication_map. The caller is responsible to keep the effective_replication_map_ptr alive while in use by low-level async functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:34:23 +03:00
Benny Halevy	59375e4751	alternator: ttl: use naked e_r_m pointers Prepare for following patch that will separate the local effective replication map from vnode_effective_replication_map. The caller is responsible to keep the effective_replication_map_ptr alive while in use by low-level async functions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:34:23 +03:00
Benny Halevy	ec85678de1	locator: abstract_replication_strategy: define is_local Prefer for specializing the local replication strategy, local effective replication map, et. al byt defining an is_local() predicate, similar to uses_tablets(). Note that is_vnode_based() still applies to local replication strategy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:34:23 +03:00
Nikos Dragazis	1eb99fb5f5	test/lib: Add option to preserve tmpdir on exception Extend the tmpdir class with an option to preserve the directory if the destructor is called during stack unwinding (i.e., uncaught exception). To be used in tests where the tmpdir contains non-temporary resources that may help in diagnosing test failures (e.g., logs from external services such as PyKMIP). This will be used in the next patch. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-06 13:07:52 +03:00
Pavel Emelyanov	0616407be5	Merge 'rest_api: add endpoint which drops all quarantined sstables' from Taras Veretilnyk Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API. This endpoint allows dropping all quarantined SSTables either globally or for a specific keyspace and tables. Optional query parameters `keyspace` and `tables` (comma-separated table names) can be provided to limit the scope of the operation. Fixes scylladb/scylladb#19061 Backport is not required, it is new functionality Closes scylladb/scylladb#25063 * github.com:scylladb/scylladb: docs: Add documentation for the nodetool dropquarantinedsstables command nodetool: add command for dropping quarantine sstables rest_api: add endpoint which drops all quarantined sstables	2025-08-06 11:55:15 +03:00
Nadav Har'El	10588958e0	test/alternator: add regression test for keep-alive support An Alternator user complained about suspiciously many new connections being opened, which raised a suspicion that maybe Alternator doesn't support HTTP and HTTPS keep-alive (allowing a client to reuse the same connection for multiple requests). It turns out that we never had a regression test that this feature actually works (and doesn't break), so this patch adds one. The test confirms that Alternator's connection reuse (keep-alive) feature actually works correctly. Of course, only if the driver really tries to reuse a connection - which is a separate question and needs testing on the driver side (scylladb/alternator-load-balancing#82). The test sends two requests using Python's "requests" library which can normally reuse connections (it uses a "connection pool"), and checks if the connection was really reused. Unfortunately "requests" doesn't give us direct knowledge of whether or not it reused a connection, so we check this using simple monkey-patching. I actually tried multiple other approaches before settling on this one. The approach needs to work on both HTTP and HTTPS, and also on AWS DynamoDB. Importantly, the test checks both keep-alive and non-keep-alive cases. This is very important for validating the test itself and its tricky monkey-patching code: The test is meant to detect when the socket is not reused for the second request, so we want to also check the non-keep- alive case where we know the socket isn't reused, to see the test code really detected this situation. By default, this test runs (like all of Alternator's test suite) on HTTP sockets. Running this test with "test/alternator/run --https" will run it on HTTPS sockets. The test currently passes on both HTTP and HTTPS. It also passes on AWS DynamoDB ("test/alternator/run --aws") Fixes #23067 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25202	2025-08-06 11:41:21 +03:00
Avi Kivity	630b3d31bb	storage_proxy: reduce allocations in send_to_live_endpoints() send_to_live_endpoints() computes sets of endpoints to which we send mutations - remote endpoints (where we send to each set as a whole, using forwarding), and local endpoints, where we send directly. To make handling regular, each local endpoint is treated as its own set. Thus, each local endpoint and each datacenter receive one RPC call (or local call if the coordinator is also a replica). These sets are maintained a std::unordered_map (for remote endpoints) and a vector with the same value_type as the map (for local endpoints). The key part of the vector payload is initialized to the empty string. We simplify this by noting that the datacenter name is never used after this computation, so the vector can hold just the replica sets, without the fake datacenter name. The downstream variable `all` is adjusted to point just to the replica set as well. As a reward for our efforts, the vector's contents becomes nothrow move constructible (no string), and we can convert it to a small_vector, which reduces allocations in the common case of RF<=3. The reduction in allocations is visible in perf-simple-query --write results: ``` before 165080.62 tps ( 60.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 53438 insns/op, 26705 cycles/op, 0 errors) after 164513.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 53347 insns/op, 26761 cycles/op, 0 errors) ``` The instruction count reduction is a not very impressive 70/op: before ``` instructions_per_op: mean= 53412.22 standard-deviation=32.12 median= 53420.53 median-absolute-deviation=20.32 maximum=53462.23 minimum=53290.06 ``` after ``` instructions_per_op: mean= 53350.32 standard-deviation=32.38 median= 53353.71 median-absolute-deviation=13.60 maximum=53415.20 minimum=53222.24 ``` Perhaps the extra code from small_vector defeated some inlining, which negated some of the gain from the reduced allocations. Perhaps a build with full profiling will gain it back (my builds were without pgo). Closes scylladb/scylladb#25270	2025-08-06 11:28:20 +03:00
Karol Nowacki	032e8f9030	test/boost/vector_store_client_test.cc: Fix flaky tests The vector_store_client_test was observed to be flaky, sometimes hanging while waiting for a response from HTTP server. Problem: The default load balancing algorithm (in Seastar's posix_server_socket_impl::accept) could route an incoming connection to a different shard than the one executing the test. Because the HTTP server is a non-sharded service running only on the test's originating shard, any connection submitted to another shard would never be handled, causing the test client to hang waiting for response. Solution: The patch resolves the issue by explicitly setting fixed cpu load balancing algorithm. This ensures that incoming connections are always handled on the same shard where the HTTP server is running. Closes scylladb/scylladb#25314	2025-08-06 11:24:51 +03:00
Taras Veretilnyk	bcb90c42e4	docs: Sort commands list in nodetool.rst Fixes scylladb/scylladb#25330 Closes scylladb/scylladb#25331	2025-08-06 11:20:53 +03:00
Nikos Dragazis	b1d5a67018	encryption: gcp: Fix the grant type for user credentials Exchanging a refresh token for an access token requires the "refresh_token" grant type [1]. [1] https://datatracker.ietf.org/doc/html/rfc6749#section-6 Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-06 10:39:17 +03:00
Nadav Har'El	fa86405b1f	test/alternator: utility functions for changing configuration Now that the previous patch made it possible to write to system tables in Alternator tests, this patch introduces utility functions for changing the configuration - scylla_config_write() in addition to the scylla_config_read() we already had, and scylla_config_temporary() to temporarily change a configurable parameter and then restore it to its old value. This patch adds a silly test that temporarily modifies the query_tombstone_page_limit configuration parameter. Later we can add more tests that use the new test functions for more "serious" testing of real features. In particular, we don't have an Alternator test for the max_concurrent_requests_per_shard configuration - and I want to write one. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-06 10:02:24 +03:00
Nadav Har'El	a896e2dbb9	alternator: add optional support for writing to system table In commit `44a1daf` we added the ability to read system tables through the DynamoDB API (actually, the Scan and Query requests only). This ability is useful for tests, and can also be useful to users who want to read information that is only available through system tables. This patch adds support also for writing into system tables. This will be useful for Alternator tests, were we want to temporarily change some live-updatable configuration option - and so far haven't been able to do that like we did do in some cql-pytest tests. For reasons explained in issue #23218, only superuser roles are allowed to write to system tables - it is not enough for the role to be granted MODIFY permissions on the system table or on ALL KEYSPACES. Moreover, the ability to modify system tables carries special risks, so this patch only allows writes to the system tables if a new configuration option "alternator_allow_system_table_write" turned on. This option is turned off by default. This patch also includes a test for this new configuration-writing capability. The test scripts test/alternator/run and test.py now run Scylla with alternator_allow_system_table_write turned on, but the new test can also run without this option, and will be skipped in that case (to allow running the test suite against some manually- run instance of Scylla). Fixes: #12348 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-06 10:00:04 +03:00
Nadav Har'El	5913498fff	test/alternator: reduce duplicated code Four tests had almost identical code to read an item from Scylla configuration (using the system.config system table). It's time to make this into a new utility function, scylla_config_read(). This is a good time to do it, because in a later patch I want to also add a similar function to write into the configuration. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-06 09:56:47 +03:00
Nadav Har'El	d46dda0840	Merge 'cql, vector_search: implement read path' from null This pull request is an addition of ANN OF queries. The patch contains: - CQL syntax for ORDER BY `vector_column_name` ANN OF `vector_literal` clause of SELECT statements. - implementation of external ANN queries (using vector-store service) - tests Example syntax: ``` SELECT comment FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 3; ``` Limit can be between 1 and 1000 - same as for Cassandra. Co-authored-by: @janpiotrlakomy @smoczy123 Fixes: VECTOR-48 Fixes: VECTOR-46 Closes scylladb/scylladb#24444 * github.com:scylladb/scylladb: cql3/statements: implement external `ANN OF` queries vector_store_client: implement ann_error_visitor test/cqlpy: check ANN queries disallow filtering properly cassandra_tests: translate vector_invalid_query_test cassandra_tests: copy vector_invalid_query_test from Cassandra vector_index: make parameter names case insensitive cql3/statements: add `ANN OF` queries support to select statements cql/Cql.g: extend the grammar to allow for `ANN OF` queries cql3/raw: add ANN ordering to the raw statement layer	2025-08-06 09:53:38 +03:00
Nikos Dragazis	77cc6a7bad	encryption: gcp: Expand tilde in pathnames for credentials file The GCP host searches for application default credentials in known locations within the user's home directory using `seastar::file_exists()`. However, this function does not perform tilde expansion in pathnames. Replace tildes with the home directory from the HOME environment variable. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-06 09:46:08 +03:00
Avi Kivity	bb922b2aa9	Merge 'truncate: change check for write during truncate into a log warning' from Ferenc Szili TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised. The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands. This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, and the offending replay positions which caused the check to fail. This PR also adds a test which validates that TRUNCATE works correctly with concurrent writes. More specifically, it checks that: - all data written before TRUNCATE starts is deleted - none of the data after TRUNCATE completes is deleted Fixes: #25173 Fixes: #25013 Backport is needed in versions which check for truncate with concurrent writes using `on_internal_error()`: 2025.3 2025.2 2025.1 Closes scylladb/scylladb#25174 * github.com:scylladb/scylladb: truncate: add test for truncate with concurrent writes truncate: change check for write during truncate into a log warning	2025-08-06 00:03:37 +03:00
Michał Chojnowski	9930cd59eb	sstables/trie: add tests for BTI node serialization and traversal Adds tests which check that nodes serialized by `bti_node_sink` are readable by `bti_node_reader` with the right result. (Note: there are no tests which check compatibility of the encoded nodes with Cassandra or with handwritten hexdumps. There are only tests for mutual compatibility between Scylla's writers and readers. This can be considered a gap in testing.)	2025-08-05 21:48:24 +02:00
Pavel Emelyanov	10056a8c6d	Merge 'Simplify credential reload: remove internal expiration checks' from Ernest Zaslavsky This PR introduces a refinement in how credential renewal is triggered. Previously, the system attempted to renew credentials one hour before their expiration, but the credentials provider did not recognize them as expired—resulting in a no-op renewal that returned existing credentials. This led the timer fiber to immediately retry renewal, causing a renewal storm. To resolve this, we remove expiration (or any other checks) in `reload` method, assuming that whoever calls this method knows what he does. Fixes: https://github.com/scylladb/scylladb/issues/25044 Should be backported to 2025.3 since we need this fix for the restore Closes scylladb/scylladb#24961 * github.com:scylladb/scylladb: s3_creds: code cleanup s3_creds: Make `reload` unconditional s3_creds: Add test exposing credentials renewal issue	2025-08-05 17:49:13 +03:00
Michael Litvak	faebfdf006	test/cluster/test_tablets_colocation: fix flaky test When restarting the server in the test, wait for it to become ready before requesting tablet repair. Fixes scylladb/scylladb#25261 Closes scylladb/scylladb#25263	2025-08-05 15:36:03 +02:00
Avi Kivity	4c785b31c7	Merge 'List Alternator clients in system.clients virtual table' from Nadav Har'El Before this series, the "system.clients" virtual table lists active connections (and their various properties, like client address, logged in username and client version) only for CQL requests. This series adds also Alternator clients to system.clients. One of the interesting use cases of this new feature is understanding exactly which SDK a user is using -without inspecting their application code. Different SDKs pass different "User-Agent" headers in requests, and that User-Agent will be visible in the system.clients entries for Alternator requests as the "driver_name" field. Unlike CQL where logged in username, driver name, etc. applies to a complete connection, in the Alternator API, different requests can theoretically be signed by different users and carry different headers but still arrive over the same HTTP connection. So instead of listing the currently open Alternator connections, we will list the currently active requests. The first three patches introduce utilities that will be useful in the implementation. The fourth patch is the implementation itself (which is quite simple with the utility introduced in the second patch), and the fifth patch a regression test for the new feature. The sixth patch adds documentation, the seventh patch refactors generic_server to use the newly introduced utility class and reduce code duplication, and the eighth patch adds a small check to an existing check of CQL's system.clients. Fixes #24993 This patch adds a new feature, so doesn't require a backport. Nevertheless, if we want it to get to existing customers more quickly to allow us to better understand their use case by reading the system.clients table, we may want to consider backporting this patch to existing branches. There is some risk involved in this patch, because it adds code that gets run on every Alternator request, so a bug on it can cause problems for every Alternator request. Closes scylladb/scylladb#25178 * github.com:scylladb/scylladb: test/cqlpy: slightly strengthen test for system.clients generic_server: use utils::scoped_item_list docs/alternator: document the system.clients system table in Alternator alternator: add test for Alternator clients in system.clients alternator: list active Alternator requests in system.clients utils: unit test for utils::scoped_item_list utils: add a scoped_item_list utility class utils: add "fatal" version of utils::on_internal_error()	2025-08-05 15:55:41 +03:00
Ferenc Szili	33488ba943	truncate: add test for truncate with concurrent writes test_validate_truncate_with_concurrent_writes checks if truncate deletes all the data written before the truncate starts, and does not delete any data after truncate completes.	2025-08-05 13:54:14 +02:00
Jan Łakomy	447c66f4ec	cql3/statements: implement external `ANN OF` queries Implement execution of `ANN OF` queries using the vector_store service. Throw invalid_request_exception with specific message using the ann_error_visitor when ANN request returns no result. Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com> Co-authored-by: Michał Hudobski <michal.hudobski@scylladb.com>	2025-08-05 12:34:48 +02:00
Dawid Pawlik	7a826b79d9	vector_store_client: implement ann_error_visitor Implement ann_error_visitor managing error messages depending on ANN error type received.	2025-08-05 12:34:48 +02:00
Dawid Pawlik	74f603fe99	test/cqlpy: check ANN queries disallow filtering properly Add tests checking if filtering with clustering column or using index is disallowed while performing ANN query.	2025-08-05 12:34:48 +02:00
Amnon Heiman	5abbfa1e52	executor: Extend API of put_or_delete_item This patch adds two methods to put_or_delete_item that will be used in WCU batch calculation: A setter to set the item length. A boolean getter that determines if this is a delete action.	2025-08-05 10:23:36 +03:00
Pavel Emelyanov	5fcdf948d9	doc: Update system.clients schema with scheduling_group cell It was added by `9319d65971` (db/virtual_tables: add scheduling group column to system.clients) recently. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25294	2025-08-05 10:16:20 +03:00
Amnon Heiman	f75aa1b35e	alternator/executor.cc: Accurate WCU for put, update, delete To improve the accuracy of Write Capacity Unit (WCU) calculation, this patch introduces the use of the `alternator_force_read_before_write` configuration option. When enabled, it forces a read-before-write operation on `PutItem`, `UpdateItem`, and `DeleteItem` requests. This comes with performance overhead and should be used with caution, especially in high-throughput environments.	2025-08-05 10:01:32 +03:00
Amnon Heiman	94a3556be5	config: add alternator_force_read_before_write This patch introduces a new configuration parameter, `alternator_force_read_before_write`, which forces Alternator to perform a read-before-write on all write operations (`PutItem`, `UpdateItem`, and `DeleteItem`), even when not strictly required. Enabling this option ensures abetter DynamoDB compatibility in WCU calculation. by accounting for the size of the existing item, as done in DynamoDB. This option introduces performance overhead and should be used with care. The parameter is runtime-configurable and can be toggled via CQL: UPDATE system.config SET value = 'true' WHERE name = 'alternator_force_read_before_write';	2025-08-05 08:31:00 +03:00
Michał Chojnowski	85964094f6	sstables/trie: implement BTI node traversal This commit implements routines for traversal of BTI nodes in their on-disk format. The `node_reader` concept is currently unused (i.e. not asserted by any template). It will only be used in the next PR, which will implement trie cursor routines parametrized `node_reader`. But I'm including it in this PR to make it clear which functions will be needed by the higher layer.	2025-08-05 00:56:50 +02:00
Michał Chojnowski	302adfb50d	sstables/trie: implement BTI serialization This commit introduces code responsibe for serializing trie nodes (`writer_node`) into the on-disk BTI format, as described in: `f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md`	2025-08-05 00:56:50 +02:00
Michał Chojnowski	6fe7dbaedc	utils/cached_file: add get_shared_page() BTI index is page-aware. It's designed to be read in page units. Thus, we want a `cached_file` accessor which explicitly requests a whole page, preferably without copying it. `cached_file` already works in terms of reference-counted pages, underneath. This commit only adds some accessors which lets us request those reference-counting page pointers more directly.	2025-08-05 00:56:50 +02:00
Michał Chojnowski	58d768e383	utils/cached_file: replace a std::pair with a named struct Cosmetic change. For clarity.	2025-08-05 00:55:32 +02:00
Artsiom Mishuta	4b975668f6	tiering (test.py): introduce tiering labels introduce tiering marks 1 “unstable” - For unstable tests that will be will continue runing every night and generate up-to-date statistics with failures without failing the “Main” verification path(scylla-ci, Next) 2 “nightly” - for tests that are quite old, stable, and test functionality that rather not be changed or affected by other features, are partially covered in other tests, verify non-critical functionality, have not found any issues or regressions, too long to run on every PR, and can be popped out from the CI run. set 7 long tests(according to statistic in elastic) as nightly(theses 8 tests took 20% of CI run, about 4 hours without paralelization) 1 test as unstable(as exaple ot marker usage) Closes scylladb/scylladb#24974	2025-08-04 15:38:16 +03:00
Ferenc Szili	268ec72dc9	truncate: change check for write during truncate into a log warning TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised. The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands. This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, the truncated_at timepoint, the offending replay positions which caused the check to fail. Fixes: #25173 Fixes: #25013	2025-08-04 12:24:50 +02:00
Piotr Dulikowski	ec7832cc84	Merge 'Raft-based recovery procedure: simplify rolling restart with recovery_leader' from Patryk Jędrzejczak The following steps are performed in sequence as part of the Raft-based recovery procedure: - set `recovery_leader` to the host ID of the recovery leader in `scylla.yaml` on all live nodes, - send the `SIGHUP` signal to all Scylla processes to reload the config, - perform a rolling restart (with the recovery leader being restarted first). These steps are not intuitive and more complicated than they could be. In this PR, we simplify these steps. From now on, we will be able to simply set `recovery_leader` on each node just before restarting it. Apart from making necessary changes in the code, we also update all tests of the Raft-based recovery procedure and the user-facing documentation. Fixes scylladb/scylladb#25015 The Raft-based procedure was added in 2025.2. This PR makes the procedure simpler and less error-prone, so it should be backported to 2025.2 and 2025.3. Closes scylladb/scylladb#25032 * github.com:scylladb/scylladb: docs: document the option to set recovery_leader later test: delay setting recovery_leader in the recovery procedure tests gossip: add recovery_leader to gossip_digest_syn db: system_keyspace: peers_table_read_fixup: remove rows with null host_id db/config, gms/gossiper: change recovery_leader to UUID db/config, utils: allow using UUID as a config option	2025-08-04 08:29:32 +02:00
Ernest Zaslavsky	837475ec6f	s3_creds: code cleanup Remove unnecessary code which is no more used	2025-08-04 09:26:11 +03:00
Ernest Zaslavsky	e4ebe6a309	s3_creds: Make `reload` unconditional Assume that any caller invoking `reload` intends to refresh credentials. Remove conditional logic that checks for expiration before reloading.	2025-08-03 17:41:35 +03:00
Ernest Zaslavsky	68855c90ca	s3_creds: Add test exposing credentials renewal issue Add a test demonstrating that renewing credentials does not update their expiration. After requesting credentials again, the expiration remains unchanged, indicating no actual update occurred.	2025-08-03 17:41:25 +03:00
Avi Kivity	1c25aa891b	Merge 'storage_proxy.cc: get_cas_shard: fallback to the primary replica shard' from Petr Gusev Currently, `get_cas_shard` uses `sharder.shard_for_reads` to decide which shard to use for LWT execution—both on replicas and the coordinator. If the coordinator is not a replica, `shard_for_reads` returns a default shard (shard 0). There are at least two problems with this: * shard 0 can become overloaded, because all LWT coordinators-but-not-replacas are served on it. * mismatch with replicas: the default shard doesn't match what `shard_for_reads` returns on replicas. This hinders the "same shard for client and server" RPC level optimization. In this PR we change `get_cas_shard` to use a primary replica shard if the current node is not a replica. This guarantees that all LWT coordinators for the same tablet will be served on the same shard. This is important for LWT coordinator locks (`paxos::paxos_state::get_cas_lock`). Also, if all tablet replicas on different nodes live on the same shard, RPC optimization will make sure that no additional `smp::submit_to` will be needed on server side. backport: not needed, since this fix applies only to LWT over tablets, and this feature is not released yet Closes scylladb/scylladb#25224 * github.com:scylladb/scylladb: test_tablets_lwt.py: make tests rf_rack_valid test_tablets_lwt: add test_lwt_coordinator_shard storage_proxy.cc: get_cas_shard: fallback to the primary replica shard sharder: add try_get_shard_for_reads method	2025-08-01 23:07:25 +03:00
Avi Kivity	8b1bf46086	Merge 'sstables: introduce trie_writer' from Michał Chojnowski This is the first part of a larger project meant to implement a trie-based index format. (The same or almost the same as Cassandra's BTI). As of this patch, the new code isn't used for anything yet, but we introduced separately from its users to keep PRs small enough for reviewability. This commit introduces trie_writer, a class responsible for turning a stream of (key, value) pairs (already sorted by key) into a stream of serializable nodes, such that: 1. Each node lies entirely within one page (guaranteed). 2. Parents are located in the same page as their children (best-effort). 3. Padding (unused space) is minimized (best-effort). It does mostly what you would expect a "sorted keys -> trie" builder to do. The hard part is calculating the sizes of nodes (which, in a well-packed on-disk format, depend on the exact offsets of the node from its children) and grouping them into pages. This implementation mostly follows Cassandra's design of the same thing. There are some differences, though. Notable ones: 1. The writer operates on chains of characters, rather than single characters. In Cassandra's implementation, the writer creates one node per character. A single long key can be translated to thousands of nodes. We create only one node per key. (Actually we split very long keys into a few nodes, but that's arbitrary and beside the point). For BTI's partition key index this doesn't matter. Since it only stores a minimal unique prefix of each key, and the trie is very balanced (due to token randomness), the average number of new characters added per key is very close to 1 anyway. (And the string-based logic might actually be a small pessimization, since manipulating a 1-byte string might be costlier than manipulating a single byte). But the row index might store arbitrarily long entries, and in that case the character-based logic might result in catastrophically bad performance. For reference: when writing a partition index, the total processing cost of a single node in the trie_writer is on the order of 800 instructions. Total processing cost of a single tiny partition during a `upgradesstables` operation is on the order of 10000 instructions. A small INSERT is on the order of 40000 instructions. So processing a single 1000-character clustering key in the trie_writer could cost as much as 20 INSERTs, which is scary. Even 100-character keys can be very expensive. With extremely long keys like that, the string-based logic is more than ~100x cheaper than character-based logic. (Note that only new characters matter here. If two index entries share a prefix, that prefix is only processed once. And the index is only populated with the minimal prefix needed to distinguish neighbours. So in practice, long chains might not happen often. But still, they are possible). I don't know if it makes sense to care about this case, but I figured the potential for problems is too big to ignore, so I switched to chain-based logic. 2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger than a full page after revising the estimate, Cassandra splits it in a different way than us. For testability, there is some separation between the logic responsible for turning a stream of keys into a stream of nodes, and the logic responsible for turning a stream of nodes into a stream of bytes. This commit only includes the first part. It doesn't implement the target on-disk format yet. The serialization logic is passed to trie_writer via a template parameter. There is only one test added in this commit, which attempts to be exhaustive, by testing all possible datasets up to some size. The run time of the test grows exponentially with the parameter size. I picked a set of parameters which runs fast enough while still being expressive enough to cover all the logic. (I checked the code coverage). But I also tested it with greater parameters on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization). Refs scylladb/scylladb#19191 New functionality, no backporting needed. Closes scylladb/scylladb#25154 * github.com:scylladb/scylladb: sstables: introduce trie_writer utils/bit_cast: add object_representation()	2025-08-01 20:23:24 +03:00
Nikos Dragazis	b186c48a65	encryption-at-rest.rst: add "Rotate Encryption Keys" section Add a new section for key rotation, offering separate instructions per key provider, organized in tabs. The gist: * Local Key Provider - Rotation requires creating a new key file per node. It's a manual procedure. * Replicated Key Provider - Rotation is not supported. * KMIP Key Provider - Rotation is transparent to Scylla, but it requires manually revoking the key in the server. * {KMS,GCP} Key Provider - Rotation is transparent to Scylla and can be automated in the server. * Azure Key Provider - Rotation is automatically supported by Scylla by keeping track of the key version along with the encrypted data. The rotation needs to be done at the Key Vault server, and can be automated. Explain that, even after rotation, old keys may be still in use due to caching, and that old SSTables will remain encrypted with the old key until the next compaction. Provide instructions in case they prefer not to wait. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:46 +03:00
Nikos Dragazis	3abacaa465	encryption-at-rest.rst: rewrite "Encrypt System Resources" section - Mention all types of system data that fall under system encryption. - Add "Before you Begin" section with requirements per key provider. The requirements are the same as in user encryption. - Mention explicitly that the Replicated Key Provider cannot be used for system encryption. - Provide separate instructions for each key provider. Explain all the configuration options. - Provide an extra example for the Local Key Provider with a ``system_key_directory`` and ``key_name``. - Highlight the code blocks as YAML. Make their indentation consistent with the rest of the doc (2 spaces). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:46 +03:00
Nikos Dragazis	c59f71b399	encryption-at-rest.rst: rewrite "Update Encryption Properties of Existing Tables" section - Split the various scenarios into sub-sections, not just examples. - Amend the example for changing cipher algorithm and key length. The algorithm used in the example was the same. - Point out that disabling encryption through the table schema is not possible if a node has default encryption configured. - Amend the `nodetool upgradesstables` command. The `--include-all-sstables` is necessary. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:46 +03:00
Nikos Dragazis	22f941b325	encryption-at-rest.rst: rewrite "Encrypt a Single Table" section - Add a short intro. - Add an early note about the fact that options from ``scylla_encryption_options`` cannot be mixed with options from ``user_info_encryption``. - Add a new "Allow Per-Table Encryption" subsection to document the ``allow_per_table_encryption`` option. - Move the top-level procedure into a new "Encrypt a New Table" subsection to differentiate it from the "Update Encryption Properties of Existing Tables"". - Add tabs for provider-dependent steps in "Before you Begin" and "Procedure". - Amend "bytes" to "bits" (for the key length). - Add examples for the replicated, KMIP, GCP, and Azure key providers. Use consistent keyspace and table names in all examples. - Remove step for upgrading SSTables. The table is new - no SSTables exist yet. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:46 +03:00
Nikos Dragazis	bd83f3e672	encryption-at-rest.rst: rewrite "Encrypt Tables" section - Provide separate requirements and instructions for each key provider, organized in tabs. - Mention explicitly that the Replicated Key Provider cannot be used for default encryption. - Fix indentation for code blocks in examples (2 spaces). - For KMS, GCP, and Azure, add the `master_key` option in the list of options and remove the relevant example (not so common). - Add steps for rolling restart. - Amend "bytes" to "bits" (for the key length). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:46 +03:00
Nikos Dragazis	fb030b11c3	encryption-at-rest.rst: update "Set the Azure Host" section - Mark the `master_key` as required. Technically, it's not, since it can be specified in the schema encryption options, but: - It's better to keep it simple. The common case is to have a default value that occasionally needs to be overridden. - No functionality is lost. - It is mentioned as required for AWS and GCP. - Add a note about credential resolution. - Make some minor formatting changes to be consistent with the AWS and GCP sections. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Nikos Dragazis	e25b283c8d	encryption-at-rest.rst: update "Set the GCP Host" section - Add list of requirements (KMS Key, credentials, permissions). - Add a reference to "Create Encryption Keys" section. - Amend description for `master_key`. - Add one example per credential type. - Explain how credentials are resolved if not explicitly specified in the configuration. - Fix indentation of "restart" command. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Nikos Dragazis	d9242ba47f	encryption-at-rest.rst: update "Set the KMS Host" section - Add a list of requirements (KMS key, credentials, permissions). - Add a reference to "Create Encryption Keys" section. - Add one example per credential type. - Explain how credentials are resolved from the environment, or the AWS credentials file. - Fix indentation of "restart" command. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Nikos Dragazis	cf9301c573	encryption-at-rest.rst: update "Set the KMIP Host" section - Uncomment the code block to match the other hosts. - Remove the ``certficate_revocation_list`` option; it's not supported. - Amend the default values for ``key_cache_expiry`` and ``key_cache_refresh``. - Add an example with mutual TLS authentication. - Fix indentation of "restart" command. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Nikos Dragazis	b777dd267d	encryption-at-rest.rst: rewrite "Create Encryption Keys" section - Provide separate instructions for each key provider, organized in tabs. Move the existing instructions with the key generator script under the "Local Key Provider" tab. Point to the cloud provider's documentation for AWS, GCP, and Azure keys. List the required attributes for KMIP keys. List the required keys for the Replicated Key Provider. - In the example for the key generator script, use the same algorithm and key strength for both the secret key and the system key, since this is the recommended case. - Reorder the usage list of arguments for the key generator script. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Nikos Dragazis	60df275197	encryption-at-rest.rst: rewrite "Key Providers" section - Use monospace font for key provider factories. - Add a sub-section for every key provider. Explain how they operate at a high level and highlight any possible limitations. - Remove version availability notes. The version 2019.1.3 is old and unsupported. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Nikos Dragazis	3c2f4ed1e7	encryption-at-rest.rst: hoist and update "Cipher Algorithm Descriptors" Turn an earlier reference to "algorithm descriptor" into a hyperlink. Use monospace font in the table header for "cipher_algorithm" and "secret_key_strength"; these are verbatim identifiers in "scylla.yaml" and "scylla_encryption_options". Same for their supported values. Restrict the Blowfish key size to 128 bits, due to <https://github.com/scylladb/scylla-enterprise/issues/4848>. Add notes on ECB vs. CBC, and on Blowfish's 64-bit block size. Emphasize our recommendation more. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Laszlo Ersek	f07125cfea	encryption-at-rest.rst: rewrite/replace section "Encryption Key Types" - Referring to system info encryption vs. user info encryption as distinct "encryption key types" is confusing. The behavior of encryption is similar in both cases, only the sets of data that are subject to encryption differ. Rename the section to "Data Classes for Encryption". - Introduce the two highest-level "scylla.yaml" stanzas, "system_info_encryption" and "user_info_encryption". Subsequently, we'll expand on their (common!) contents later. - Remove the comment that, for the Local Key Provider, a keystore can be created either manually or automatically. This is stated / repeated elsewhere in the document. - Remove the unused anchor "_Replicated". - The notes on the Replicated Key Provider both lack nuance, and are ill-placed, here. Remove those notes. Add a dedicated description for Replicated later, elsewhere. Do mention "system_replicated_keys.encrypted_keys" here in passing, as a system table with sensitive contents. - The short listing of key providers is ill-placed here. We have an entire section dedicated to those. Furthermore, the various key providers apply to system info encryption, too. - Explain the two levels of configuration for SSTables of user tables. - Move the note about preserving keys for restoring backups to Key Providers \| About Local Key Storage, at least temporarily. When keys are stored on a key management server (KMIP, GCP, AWS, Azure), then backing those up is its own admin task / responsibility. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Laszlo Ersek	268f5b1564	encryption-at-rest.rst: About: describe high-level operation more precisely Clarify some table vs. SSTable differences. Spell out the SSTable metadata ("Scylla.db") component. Spell out commit log metadata files. Explain that encryption settings are "snapshotted" into those meta-files. Highlight that encryption config may vary per table and per node. (For example, a local file key provider under the same pathname on each node, referenced by the table's "scylla_encryption_options" in the schema, may provide different keys for different nodes.) Introduce "algorithm descriptor" and "key provider" as generic concepts. Touch up the grammar / vocabulary slightly. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Laszlo Ersek	8717102ae5	encryption-at-rest.rst: improve wording / formatting in About intro - Remove the KMIP password from the list of system level data. Encrypting this would require the `configuration_encryptor`, which has been removed as part of the effort to decommission all our java tools. - Provide an exhaustive list of system tables being encrypted. - "Table level granularity" is redundant; either "table level" or "table granularity" should suffice. Pick the latter. - Distinguish "block cipher" from "mode of operation" more precisely. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Laszlo Ersek	b45d7417ef	encryption-at-rest.rst: users (plural) typo fix scylladb presumably stores data for multiple users. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-08-01 17:27:45 +03:00
Laszlo Ersek	68dfa41e69	encryption-at-rest.rst: rewrap Wrap long lines at 80 chars. Seastar coding style suggests 160 chars, but 80 chars is more comfortable for side-by-side PR diffs on GitHub. Exclude arg lists and code blocks. Set the limit at 160 chars for arg lists to avoid too much wrapping that would hurt readability. Do not wrap code blocks at all. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Laszlo Ersek	54ad1fe35f	encryption-at-rest.rst: strip trailing whitespace Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-08-01 17:27:45 +03:00
Andrei Chekun	c0d652a973	test.py: change boost test stdout to use filehandler instead of pipe With current implementation if pytest will be killed, it will not be able to write the stdout from the boost test. With a new way it should be updated while test executing, instead of writing it the end of the test. Closes scylladb/scylladb#25260	2025-08-01 15:05:00 +03:00
Michał Jadwiszczak	10214e13bd	storage_service, group0_state_machine: move SL cache update from `topology_state_load()` to `load_snapshot()` Currently the service levels cache is unnecessarily updated in every call of `topology_state_load()`. But it is enough to reload it only when a snapshot is loaded. (The cache is also already updated when there is a change to one of `service_levels_v2`, `role_members`, `role_attributes` tables.) Fixes scylladb/scylladb#25114 Fixes scylladb/scylladb#23065 Closes scylladb/scylladb#25116	2025-08-01 13:41:08 +02:00
Jan Łakomy	8b2ed0f014	cassandra_tests: translate vector_invalid_query_test Translate vector_invalid_query_test which tests parsing of ANN OF syntax. Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>	2025-08-01 12:08:50 +02:00
Jan Łakomy	eec47d9059	cassandra_tests: copy vector_invalid_query_test from Cassandra Copy over and comment out this tests code from Cassandra for it to be translated later.	2025-08-01 12:08:50 +02:00
Dawid Pawlik	b29e6870fa	vector_index: make parameter names case insensitive The custom index class name 'vector_index' and it's similarity function options should be case insensitive. Before the patch the similarity functions had to be written in SCREAMING_SNAKE_CASE which was not commonly and intuitively used. Furthermore the Cassandra translated tests used the options written in snake_case and as we wanted to translate them exactly, we had to be able to use lower case option.	2025-08-01 12:08:50 +02:00
Jan Łakomy	5fecad0ec8	cql3/statements: add `ANN OF` queries support to select statements Add parsing of `ANN OF` queries to the `select_statement` and `indexed_table_select_statement` classes. Add a placeholder for the implementation of external ANN queries. Rename `should_create_view` to `view_should_exist` as it is used not only to check if the view should be created but also if the view has been created. Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>	2025-08-01 12:08:50 +02:00
Taras Veretilnyk	15e3980693	docs: Add documentation for the nodetool dropquarantinedsstables command Fixes scylladb/scylladb#19061	2025-08-01 11:46:33 +02:00
Nikos Dragazis	2656fca504	test: Use in-memory SQLite for PyKMIP server The PyKMIP server uses an SQLite database to store artifacts such as encryption keys. By default, SQLite performs a full journal and data flush to disk on every CREATE TABLE operation. Each operation triggers three fdatasync(2) calls. If we multiply this by 16, that is the number of tables created by the server, we get a significant number of file syncs, which can last for several seconds on slow machines. This behavior has led to CI stability issues from KMIP unit tests where the server failed to complete its schema creation within the 20-second timeout (observed on spider9 and spider11). Fix this by configuring the server to use an in-memory SQLite. Fixes #24842. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#24995	2025-08-01 12:11:27 +03:00
Nadav Har'El	2431f92967	alternator, test: add reproducer for issue about immediate LWT timeout This patch adds a reproducer for issue #16261, where it was reported that when Alternator read-modify-write (using LWT) operations to the same partition are sent to different nodes, sometimes the operation fails immediately, with an InternalServerError claiming to be a "timeout", although this happens almost immediately (after a few milliseconds), not after any real timeout. The test uses 3 nodes, and 3 threads which send RMW operations to different items in the same partition, and usually (though not with 100% certainty) it reaches the InternalServerError in around 100 writes by each thread. This InternalServerError looks like: Internal server error: exceptions::mutation_write_timeout_exception (Operation timed out for alternator_alternator_Test_1719157066704.alternator_Test_1719157066704 - received only 1 responses from 2 CL=LOCAL_SERIAL.) The test also prints how much time it took for the request to fail, for example: In incrementing 1,0 on node 1: error after 0.017074108123779297 This is 0.017 seconds - it's not the cas_contention_timeout_in_ms timeout (1 second) or any other timeout. If we enable trace logging, adding to topology_experimental_raft/suite.yaml extra_scylla_cmdline_options: ["--logger-log-level", "paxos=trace"] we get the following TRACE-level message in the log: paxos - CAS[0] accept_proposal: proposal is partially rejected This again shows the problem is "uncertainty" (partial rejection) and not a timeout. Refs #16261 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#19445	2025-08-01 11:58:52 +03:00
Aleksandra Martyniuk	e607ef10cd	api: storage_service: do not log the exception that is passed to user The exceptions that are thrown by the tasks started with API are propagated to users. Hence, there is no need to log it. Remove the logs about exception in user started tasks. Fixes: https://github.com/scylladb/scylladb/issues/16732. Closes scylladb/scylladb#25153	2025-08-01 09:49:51 +03:00
Nadav Har'El	edc15a3cf5	test/cqlpy: slightly strengthen test for system.clients We already have a rather rudimentary test for system.clients listing CQL connections. However, as written the test will pass if system.clients is empty :-) So let's strengthen the test to verify that there must be at least one CQL connection listed in system.clients. Indeed, the test runs the "SELECT FROM system.clients" over one CQL connection, so surely that connection must be present. This patch doesn't strengthen this test in any other way - it still has just one connection, not many, it still doesn't validate the values of most of the columns, and it is still written to assume the Scylla server is running on localhost and not running any other workload in parallel. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:32:19 +03:00
Nadav Har'El	ce0ee27422	generic_server: use utils::scoped_item_list A previous patch introduced utils::scoped_item_list, which maintains a list of items - such as a list of ongoing connections - automatically removing the item from the list when its handle is destroyed. The list can also be iterated "gently" (without risking stalls when the list is long). The implementation of this class was based on very similar code in generic_server.hh / generic_server.cc. So in this patch we change generic_server use the new scoped_item_list, and drop its own copy of the duplicated logic of maintaining the list and iterating gently over it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:32:14 +03:00
Nadav Har'El	70c94ac9dd	docs/alternator: document the system.clients system table in Alternator Add to docs/alternator/new-apis.md a full description of the `system.clients` support in Alternator that was added in the previous patches. Although arguably all Scylla system tables should work on Alternator and do not need to be individually documented, I believe that this specific table, is interesting to document. This is because some of the attributes in this table have non-obvious and Alternator-specific meanings. Moreover, there's even a diffence in what each individual item in the table represents (it represents active requests, not entire connections as in CQL). While editing the system tables section of new-apis.md, this patch also slightly improves its formatting. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:05 +03:00
Nadav Har'El	5baa4c40fd	alternator: add test for Alternator clients in system.clients This patch adds a regression test for the feature added in the previous patch, i.e that the system.clients virtual table also lists ongoing Alternator request. The new test reads the system.clients system table using an Alternator Scan request, so it should see its own request - at least - in the result. It verifies that it sees Alternator requests (at least one), and that these requests have the expected fields set, and for a couple of fields, we even know which value to expect (the "client_type" field is "alternator", and the "ssl_enabled" field depends on whether the test is checking an http:// or https:// URL (you can try both in test/alternator/run - by using or not using the "--https" parameter). The new test fails before the previous patch (because system.clients will not list any Alternator connection), and passes after it. As all tests in test_system_tables.py for Scylla-specific system tables, this test is marked scylla_only and skipped when running on AWS DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:05 +03:00
Nadav Har'El	c14b9c5812	alternator: list active Alternator requests in system.clients Today, the "system.clients" virtual table lists active connections (and their various properties, like client address, logged in username and client version) only for CQL requests. In this patch we make Alternator active clients also be listed on this virtual table. Unlike CQL where logged in username applies to a complete connection, in the Alternator API, different requests, theoretically signed by different users, can arrive over the same HTTP connection. So instead of listing the currently open connections, we list the currently active requests. This means that when scanning system.clients, you will only see requests which are being handled right now - and not inactive HTTP connections. I think this good enough (besides being the correct thing to do) - one of the goals of this system.clients is to be able to see what kind of drivers are being used by the user (the "driver_name" field in the system.clients) - on a busy server there will always be some (even many) requests being handled, so we'll always have plenty of requests to see in system.clients. By the way, note that for Alternator requests, what we use for the "driver_name" is the request's User-Agent header. AWS SDKs typically write the driver's name, its version, and often a lot of other information in that header. For example, Boto3 sends a User-Agent looking like: Boto3/1.38.46 md/Botocore#1.38.46 md/awscrt#0.24.2 ua/2.1 os/linux#6.15.4-100.fc41.x86_64 md/arch#x86_64 lang/python#3.13.5 md/pyimpl#CPython m/N,P,b,D,Z cfg/retry-mode#legacy Botocore/1.38.46 Resource A functional test for the new feature - adding Alternator requests to the system.clients table - will be in the next patch. Fixes #24993 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:05 +03:00
Nadav Har'El	20b31987e1	utils: unit test for utils::scoped_item_list The previous test introduced a new utility class, utils::scoped_item_list. This patch adds a comprehensive unit test for the new class. We test basic usage of scoped_item_list, its size() and empty() methods, how items are removed from the list when their handle goes out of scope, how a handle's move constructor works, how items can be read and written through their handles, and finally that removing an item during a for_each_gently() iteration doesn't break the iteration. One thing I still didn't figure out how to properly test is how removing an item during multiple iterations that run concurrently fixes multiple iterators. I believe the code is correct there (we just have a list of ongoing iterations - instead of just one), but haven't found yet a way to reproduce this situation in a test. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:04 +03:00
Nadav Har'El	186e6d3ce0	utils: add a scoped_item_list utility class In a later patch, we'll want Alternator to maintain a list of ongoing requests, and be able to list them when the system.clients table is read. This patch introduces a new container, utils::scoped_item_list<T>, that will help Alternator do that: 1. Each request adds an item to the list, and receives a handle; When that handle goes out of scope the item is automatically deleted from the list. 2. Also a method is provided for iterating over the list of items without risking a stall if the list is very long. The new scoped_item_list<T> is heavily based on similar code that is integrated inside generic_server.hh, which is used by CQL to similarly maintain a list of active connections and their properties. However, unfortunately that code is deeply integrated into the generic_server class, and Alternator can't use generic_server because it uses Seastar's HTTP server which isn't based on generic_server. In contrast, the container defined in this patch is stand-alone and does not depend on Alternator in any way. In a later patch in this series we will modify generic_server to use the new scoped_item_list<> instead of having that feature inside it. The next patch is a unit test for the new class we are adding in this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:04 +03:00
Nadav Har'El	33476c7b06	utils: add "fatal" version of utils::on_internal_error() utils::on_internal_error() is a wrapper for Seastar's on_internal_error() which does not require a logger parameter - because it always uses one logger ("on_internal_error"). Not needing a unique logger is especially important when using on_internal_error() in a header file, where we can't define a logger. Seastar also has a another similar function, on_fatal_internal_error(), for which we forgot to implement a "utils" version (without a logger parameter). This patch fixes that oversight. In the next patch, we need to use on_fatal_internal_error() in a header file, so the "utils" version will be useful. We will need the fatal version because we will encounter an unexpected situation during server destruction, and if we let the regular on_internal_error() just throw an exception, we'll be left in an undefined state. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-08-01 02:15:04 +03:00
Patryk Jędrzejczak	e53dc7ca86	Merge 'remove unused function and simplify some qp code.' from Gleb Natapov No backport needed since these are cleanups. Closes scylladb/scylladb#25258 * https://github.com/scylladb/scylladb: qp: fold prepare_one function into its only caller qp: co-routinize prepare_one function cql3: drop unused function	2025-07-31 18:19:47 +02:00
Taras Veretilnyk	1d6808aec4	topology_coordinator: Make tablet_load_stats_refresh_interval configurable This commits introduces an config option 'tablet_load_stats_refresh_interval_in_seconds' that allows overriding the default value without using error injection. Fixes scylladb/scylladb#24641 Closes scylladb/scylladb#24746	2025-07-31 14:31:55 +03:00
Gleb Natapov	041011b2ee	qp: fold prepare_one function into its only caller	2025-07-31 14:12:34 +03:00
Gleb Natapov	715f1d994f	qp: co-routinize prepare_one function	2025-07-31 14:11:17 +03:00
Michał Chojnowski	c8682af418	sstables: introduce trie_writer This is the first part of a larger project meant to implement a trie-based index format. (The same or almost the same as Cassandra's BTI). As of this patch, the new code isn't used for anything yet, but we introduced separately from its users to keep PRs small enough for reviewability. This commit introduces trie_writer, a class responsible for turning a stream of (key, value) pairs (already sorted by key) into a stream of serializable nodes, such that: 1. Each node lies entirely within one page (guaranteed). 2. Parents are located in the same page as their children (best-effort). 3. Padding (unused space) is minimized (best-effort). It does mostly what you would expect a "sorted keys -> trie" builder to do. The hard part is calculating the sizes of nodes (which, in a well-packed on-disk format, depend on the exact offsets of the node from its children) and grouping them into pages. This implementation mostly follows Cassandra's design of the same thing. There are some differences, though. Notable ones: 1. The writer operates on chains of characters, rather than single characters. In Cassandra's implementation, the writer creates one node per character. A single long key can be translated to thousands of nodes. We create only one node per key. (Actually we split very long keys into a few nodes, but that's arbitrary and beside the point). For BTI's partition key index this doesn't matter. Since it only stores a minimal unique prefix of each key, and the trie is very balanced (due to token randomness), the average number of new characters added per key is very close to 1 anyway. (And the string-based logic might actually be a small pessimization, since manipulating a 1-byte string might be costlier than manipulating a single byte). But the row index might store arbitrarily long entries, and in that case the character-based logic might result in catastrophically bad performance. For reference: when writing a partition index, the total processing cost of a single node in the trie_writer is on the order of 800 instructions. Total processing cost of a single tiny partition during a `upgradesstables` operation is on the order of 10000 instructions. A small INSERT is on the order of 40000 instructions. So processing a single 1000-character clustering key in the trie_writer could cost as much as 20 INSERTs, which is scary. Even 100-character keys can be very expensive. With extremely long keys like that, the string-based logic is more than ~100x cheaper than character-based logic. (Note that only new characters matter here. If two index entries share a prefix, that prefix is only processed once. And the index is only populated with the minimal prefix needed to distinguish neighbours. So in practice, long chains might not happen often. But still, they are possible). I don't know if it makes sense to care about this case, but I figured the potential for problems is too big to ignore, so I switched to chain-based logic. 2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger than a full page after revising the estimate, Cassandra splits it in a different way than us. For testability, there is some separation between the logic responsible for turning a stream of keys into a stream of nodes, and the logic responsible for turning a stream of nodes into a stream of bytes. This commit only includes the first part. It doesn't implement the target on-disk format yet. The serialization logic is passed to trie_writer via a template parameter. There is only one test added in this commit, which attempts to be exhaustive, by testing all possible datasets up to some size. The run time of the test grows exponentially with the parameter size. I picked a set of parameters which runs fast enough while still being expressive enough to cover all the logic. (I checked the code coverage). But I also tested it with greater parameters on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).	2025-07-31 12:51:37 +02:00
Calle Wilund	43f7eecf9e	compress: move compress.cc/hh to sstables/compressor Fixes #22106 Moves the shared compress components to sstables, and rename to match class type. Adjust includes, removing redundant/unneeded ones where possible. Closes scylladb/scylladb#25103	2025-07-31 13:10:41 +03:00
Pavel Emelyanov	34608450c5	Merge 'qos: don't populate effective service level cache until auth is migrated to raft' from Piotr Dulikowski Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart). Closes scylladb/scylladb#25188 * github.com:scylladb/scylladb: test: sl: verify that legacy auth is not queried in sl to raft upgrade qos: don't populate effective service level cache until auth is migrated to raft	2025-07-31 13:05:27 +03:00
Botond Dénes	7e27157664	replica/table: add_sstables_and_update_cache(): remove error log The plural overload of this method logs an error when the sstable add fails. This is unnecessary, the caller is expected to catch and handle exceptions. Furthermore, this unconditional error log results in sporadic test failures, due to the unexpected error in the logs on shutdown. Fixes: #24850 Closes scylladb/scylladb#25235	2025-07-31 12:34:40 +03:00
Jan Łakomy	e69e0cb546	cql/Cql.g: extend the grammar to allow for `ANN OF` queries Extend `orderByClause` so that it can accept the `ORDER BY 'column_name' ANN OF 'vector_literal'` syntax. Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>	2025-07-31 11:11:24 +02:00
Jan Łakomy	d073a4c1fa	cql3/raw: add ANN ordering to the raw statement layer Extend `orderings_type` to include ANN ordering. Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>	2025-07-31 11:11:24 +02:00
Petr Gusev	3500a10197	scylla_cluster.py: add try_get_host_id Tests sometimes fail in ScyllaCluster.add_server on the 'replaced_srv.host_id' line because host_id is not resolved yet. In this commit we introduce functions try_get_host_id and get_host_id that resolve it when needed. Closes scylladb/scylladb#25177	2025-07-31 10:37:06 +02:00
Patryk Jędrzejczak	c41f0e6da9	Merge 'generic server: 2 step shutdown' from Sergey Zolotukhin This PR implements solution proposed in scylladb/scylladb#24481 Instead of terminating connections immediately, the shutdown now proceeds in two stages: first closing the receive (input) side to stop new requests, then waiting for all active requests to complete before fully closing the connections. The updated shutdown process is as follows: 1. Initial Shutdown Phase * Close the accept gate to block new incoming connections. * Abort all accept() calls. * For all active connections: * Close only the input side of the connection to prevent new requests. * Keep the output side open to allow responses to be sent. 2. Drain Phase * Wait for all in-progress requests to either complete or fail. 3. Final Shutdown Phase * Fully close all connections. Fixes scylladb/scylladb#24481 Closes scylladb/scylladb#24499 * https://github.com/scylladb/scylladb: test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`, decrease request timeout. generic_server: Two-step connection shutdown. transport: consmetic change, remove extra blanks. transport: Handle sleep aborted exception in sleep_until_timeout_passes generic_server: replace empty destructor with `= default` generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output` generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class. test: Add test for query execution during CQL server shutdown	2025-07-31 10:32:30 +02:00
Nadav Har'El	78c10af960	test/cqlpy: add reproducer for INSERT JSON .. IF NOT EXISTS bug This patch adds an xfailing test reproducing a bug where when adding an IF NOT EXISTS to a INSERT JSON statement, the IF NOT EXISTS is ignored. This bug has been known for 4 years (issue #8682) and even has a FIXME referring to it in cql3/statements/update_statement.cc, but until now we didn't have a reproducing test. The tests in this patch also show that this bug is specific to INSERT JSON - regular INSERT works correctly - and also that Cassandra works correctly (and passes the test). Refs #8682 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25244	2025-07-30 20:14:50 +03:00
Piotr Smaron	8d5249420b	Update seastar submodule * seastar 60b2e7da...7c32d290 (14): > posix: Replace static_assert with concept > tls: Push iovec with the help of put(vector<temporary_buffer>) > io_queue: Narrow down friendship with reactor > util: drop concepts.hh > reactor: Re-use posix::to_timespec() helper > Fix incorrect defaults for io queue iops/bandwidth > net: functions describing ssl connection > Add label values to the duplicate metrics exception > Merge 'Nested scheduling groups (CPU only)' from Pavel Emelyanov test: Add unit test for cross-sched-groups wakeups test: Add unit test for fair CPU scheduling test: Add unit test for basic supergrops manipulations test: Add perf test for context switch latency scheduling: Add an internal method to get group's supergroup reactor: Add supergroup get_shares() API reactor: Add supergroup::set_shares() API reactor: Create scheduling groups in supergroups reactor: Supergroups destroying API reactor: Supergroups creating API reactor: Pass parent pointer to task_queue from caller reactor: Wakeup queue group on child activation reactor: Add pure virtual sched_entity::run_tasks() method reactor: Make task_queue_group be sched_entity too reactor: Split task_queue_group::run_some_tasks() reactor: Count and limit supergroup children reactor: Link sched entity to its parent reactor: Switch activate(task_queue) to work on sched_entity reactor: Move set_shares() to sched_entity() reactor: Make account_runtime() work with sched_entity reactor: Make insert_activating_task_queue() work on sched_entity reactor: Make pop_active_task_queue() work on sched_entity reactor: Make insert_active_task_queue() work on sched_entity reactor: Move timings to sched_entity reactor: Move active bit to sched_entity reactor: Move shares to sched_entity reactor: Move vruntime to sched_entity reactor: Introduce sched_entity reactor: Rename _activating_task_queues -> _activating reactor: Remove local atq variable reactor: Rename _active_task_queues -> _active reactor: Move account_runtime() to task_queue_group reactor: Move vruntime update from task_queue into _group reactor: Simplify task_queue_group::run_some_tasks() reactor: Move run_some_tasks() into task_queue_group reactor: Move insert_activating_task_queues() into task_queue_group reactor: Move pop_active_task_queue() into task_queue_group reactor: Move insert_active_task_queue() into task_queue_group reactor: Introduce and use task_queue_group::activate(task_queue) reactor: Introduce task_queue_group::active() reactor: Wrap scheduling fields into task_queue_group reactor: Simplify task_queue::activate() reactor: Rename task_queue::activate() -> wakeup() reactor: Make activate() method of class task_queue reactor: Make task_queue::run_tasks() return bool reactor: Simplify task_queue::run_tasks() reactor: Make run_tasks() method of class task_queue > Fix hang in io_queue for big write ioproperties numbers > split random io buffer size in 2 options > reactor: document run_in_background > Merge 'Add io_queue unit test for checking request rates' from Robert Bindar Add unit test for validating computed params in io_queue Move `disk_params` and `disk_config_params` to their own unit Add an overload for `disk_config_params::generate_config` Closes scylladb/scylladb#25254	2025-07-30 16:44:18 +03:00
Patryk Jędrzejczak	5ce16488c9	Merge 'test/cqlpy: two small fixes for "--release" feature' from Nadav Har'El This small series fixes two small bugs in the "--release" feature of test/cqlpy/run and test/alternator/run, which allows a developer to run signle-node functional tests against any past release of Scylla. The two patches fix: 1. Allow "run --release" to be used when Scylla has not even been built from source. 2. Fix a mistake in choosing the most recent release when only a ".0" and RC releases are available. This is currently the case for the 2025.2 branch, which is why I discovered the bug now. Fixes #25223 This patch only affects developer's experience if using the test/cqlpy/run script manually (these scripts are not used by CI), so should not be backported. Closes scylladb/scylladb#25227 * https://github.com/scylladb/scylladb: test/cqlpy: fix fetch_scylla.py for .0 releases test/cqlpy: fix "run --release" when Scylla hasn't been built	2025-07-30 15:13:26 +02:00
Petr Gusev	dea41b1764	test_tablets_lwt.py: make tests rf_rack_valid This is a refactoring commit. Remove the rf_rack_valid_keyspaces: False flag because rf_rack_validy is going to become mundatory in scylladb/scylladb#23526	2025-07-30 13:48:33 +02:00
Aleksandra Martyniuk	99ff08ae78	streaming: close sink when exception is thrown If an exception is thrown in result_handling_cont in streaming, then the sink does not get closed. This leads to a node crash. Close sink in exception handler. Fixes: https://github.com/scylladb/scylladb/issues/25165. Closes scylladb/scylladb#25238	2025-07-30 14:26:14 +03:00
Petr Gusev	bd82a9d7e5	test_tablets_lwt: add test_lwt_coordinator_shard Check that an LWT coordinator which is not a replica runs on the same shard as a replica.	2025-07-30 13:08:56 +02:00
Andrei Chekun	d0e4045103	test.py: add repeats in pytest Previous way of executin repeat was to launch pytest for each repeat. That was resource consuming, since each time pytest was doing discovery of the tests. Now all repeats are done inside one pytest process.	2025-07-30 12:03:08 +02:00
Andrei Chekun	853bdec3ec	test.py: add directories and filename to the log files Currently, only test function name used for output and log files. For better clarity adding the relative path from the test directory of the file name without extension to these files. Before: test_aggregate_avg.1.log test_aggregate_avg_stdout.1.log After: boost.aggregate_fcts_test.test_aggregate_avg.1.log boost.aggregate_fcts_test.test_aggregate_avg_stdout.3.log	2025-07-30 12:03:08 +02:00
Andrei Chekun	557293995b	test.py: rename log sink file for boost tests Log sink is outputted in XML format not just simple text file. Renaming to have better clarity	2025-07-30 12:03:08 +02:00
Andrei Chekun	cc75197efd	test.py: better error handling in boost facade If test was not executed for some reason, for example not known parameter passed to the test, but boost framework was able to finish correctly, log file will have data but it will be parsed to an empty list. This will raise an exception in pytest execution, rather than produce test output. This change will handle this situation.	2025-07-30 12:03:08 +02:00
Andrei Chekun	4c33ff791b	build: add pytest-timeout to the toolchain Adding this plugin allows using timeout for a test or timeout for the whole session. This can be useful for Unit Test Custom task in the pipeline to avoid running tests is batches, that will mess with the test names later in Jenkins. Closes #25210 [avi: regenerate frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz ] Closes scylladb/scylladb#25243	2025-07-30 12:53:10 +03:00
Gleb Natapov	e496a89f80	cql3: drop unused function	2025-07-30 12:17:23 +03:00
Avi Kivity	5e150eafa4	keys: clustering_bounds_comparator: make thread_local _empty_prefix constinit Avoids thread_local guards on every access.	2025-07-29 23:55:19 +03:00
Avi Kivity	e2316a4a66	keys: make empty creation clustering_key_prefix constexpr Short-circuit make_empty() to construct an empty managed_bytes. Sprinkle constexpr specifiers as needed to make it work.	2025-07-29 23:54:03 +03:00
Avi Kivity	5c6c944797	managed_bytes: make empty managed_bytes constexpr friendly Sprinkle constexpr where needed to make the default constructor, move constructor, and destructor constexpr. Add a test to verify. This is needed to make a thread_local variable containing an empty managed_bytes constinit, reducing thread-local guards.	2025-07-29 23:51:43 +03:00
Avi Kivity	3f6d0d832c	keys: clustering_bounds_comparator: make _empty_prefix a prefix _empty_prefix, as its name suggests, is a prefix, but its type is not. Presumably it works due to implicit conversions. There should not be a clustering_key::make_empty(), but we'll suffer it for now. Fix by making _empty_prefix a prefix.	2025-07-29 23:13:09 +03:00
Petr Gusev	e120ee6d32	storage_proxy.cc: get_cas_shard: fallback to the primary replica shard Currently, get_cas_shard uses shard_for_reads to decide which shard to use for LWT execution—both on replicas and the coordinator. If the coordinator is not a replica, shard_for_reads returns a default shard (shard 0). There are at least two problems with this: * shard 0 can become overloaded, because all LWT coordinators-but-not-replacas are served on it. * mismatch with replicas: the default shard doesn't match what shard_for_reads returns on replicas. This hinders the "same shard for client and server" RPC level optimization. In this commit we change get_cas_shard to use a primary replica shard if the current node is not a replica. This guarantees that all LWT coordinators for the same tablet will be served on the same shard. This is important for LWT coordinator locks (paxos::paxos_state::get_cas_lock). Also, if all tablet replicas on different nodes live on the same shard, RPC optimization will make sure that no additional smp::submit_to will be needed on the server side. Fixes scylladb/scylladb#20497	2025-07-29 17:07:04 +02:00
Botond Dénes	2985c343ed	Merge 'repair: Avoid too many fragments in a single repair_row_on_wire' from Asias He When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message. This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression. This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream. Tests are added to make sure the message split works. Fixes #24808 Closes scylladb/scylladb#25002 * github.com:scylladb/scylladb: repair: Avoid too many fragments in a single repair_row_on_wire repair: Change partition_key_and_mutation_fragments to use chunked_vector utils: Allow chunked_vector::erase to work with non-default-constructible type	2025-07-29 17:45:57 +03:00
Patryk Jędrzejczak	8e43856ca7	Merge 'Pass more elaborated "reasons" to stop_ongoing_compactions()' from Pavel Emelyanov When running compactions are aborted by the aforementioned helper, in logs there appear a line like "Compaction for ks/cf was stopped due to: user-triggered operation". This message could've been better, since it may indicate several distinct reasons described with the same "user-triggered operation". With this PR the message will help telling "truncate", "cleanup", "rewrite" and "split" from each other. Closes scylladb/scylladb#25136 * https://github.com/scylladb/scylladb: compaction: Pass "reason" to perform_task_on_all_files() compaction: Pass "reason" to run_with_compaction_disabled() compaction: Pass "reason" to stop_and_disable_compaction()	2025-07-29 16:06:17 +02:00
Pavel Emelyanov	286fad4da6	api: Simplify table_info::name extraction with std::views::transform Instead of using lambda, pass pointer to struct member. The result is the same, but the code is nicer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25123	2025-07-29 15:56:58 +02:00
Sergey Zolotukhin	4f63e1df58	test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`, decrease request timeout. In debug mode, queries may sometimes take longer than the default 30 seconds. To address this, the timeout value `request_timeout_on_shutdown_in_seconds` during tests is aligned with other request timeouts. Change request timeout for tests from 180s to 90s since we must keep the request timeout during shutdown significantly lower than the graceful shutdown timeout(2m), or else a request timeout would cause a graceful shutdown timeout and fail a test.	2025-07-29 15:37:47 +02:00
Nadav Har'El	22f845b128	docs/alternator: mention missing ShardFilter support Add in docs/alternator/compatibility.md a mention of the ShardFilter option which we don't support in Alternator Streams. This option was only introduced to DynamoDB a week ago, so it's not surprising we don't yet support it :-) Refs #25160 Closes scylladb/scylladb#25161	2025-07-29 14:37:24 +03:00
Andrei Chekun	a6a3d119e8	docs: update documentation with new way of running C++ tests Documentation had outdated information how to run C++ test. Additionally, some information added about gathered test metrics. Closes scylladb/scylladb#25180	2025-07-29 14:36:19 +03:00
Dawid Mędrek	408b45fa7e	db/commitlog: Extend error messages for corrupted data We're providing additional information in error messages when throwing an exception related to data corruption: when a segment is truncated and when it's content is invalid. That might prove helpful when debugging. Closes scylladb/scylladb#25190	2025-07-29 14:35:14 +03:00
Anna Stuchlik	b67bb641bc	doc: add OS support for ScyllaDB 2025.3 This commit adds the information about support for platforms in ScyllaDB version 2025.3. Fixes https://github.com/scylladb/scylladb/issues/24698 Closes scylladb/scylladb#25220	2025-07-29 14:33:12 +03:00
Anna Stuchlik	8365219d40	doc: add the upgrade guide from 2025.2 to 2025.3 This PR adds the upgrade guide from version 2025.2 to 2025.3. Also, it removes the upgrade guide existing for the previous version that is irrelevant in 2025.2 (upgrade from 2025.1 to 2025.2). Note that the new guide does not include the "Enable Consistent Topology Updates" page and note, as users upgrading to 2025.3 have consistent topology updates already enabled. Fixes https://github.com/scylladb/scylladb/issues/24696 Closes scylladb/scylladb#25219	2025-07-29 14:32:31 +03:00
Avi Kivity	11ee58090c	commitlog: replace std::enable_if with a constraint std::enable_if is obsolete and was replaced with concepts and constraint. Replace the std::is_fundamental_v enable_if constraint with std::integral. The latter is more accurate - std::ntoh() is not defined for floats, for example. In any case, we only read integrals in commitlog. Closes scylladb/scylladb#25226	2025-07-29 12:51:24 +02:00
Michał Chojnowski	6d27065f99	cql3/result_set: set GLOBAL_TABLES_SPEC in `metadata` if appropriate Unless the client uses the SKIP_METADATA flag, Scylla attaches some metadata to query results returned to the CQL client. In particular, it attaches the spec (keyspace name, table name, name, type) of the returned columns. By default, the keyspace name and table name is present in each column spec. However, since they are almost always the same for every column (I can't think of any case when they aren't the same; it would make sense if Cassandra supported joins, but it doesn't) that's a waste. So, as an optimization, the CQL protocol has the GLOBAL_TABLES_SPEC flag. The flag can be set if all columns belong to the same table, and if is set, then the keyspace and table name are only written in the first column spec, and skipped in other column specs. Scylla sets this flag, if appropriate, in responses to a PREPARE requests. But it never sets the flag in responses to queries. But it could. And this patch causes it to do that. Fixes #17788 Closes scylladb/scylladb#25205	2025-07-29 12:40:12 +03:00
Piotr Dulikowski	3a082d314c	test: sl: verify that legacy auth is not queried in sl to raft upgrade Adjust `test_service_levels_upgrade`: right before upgrade to topology on raft, enable an error injection which triggers when the standard role manager is about to query the legacy auth tables in the system_auth keyspace. The preceding commit which fixes scylladb/scylladb#24963 makes sure that the legacy tables are not queried during upgrade to topology on raft, so the error injection does not trigger and does not cause a problem; without that commit, the test fails.	2025-07-29 11:39:17 +02:00
Piotr Dulikowski	2bb800c004	qos: don't populate effective service level cache until auth is migrated to raft Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963	2025-07-29 11:37:37 +02:00
Petr Gusev	801bf42ea2	sharder: add try_get_shard_for_reads method Currently, we use storage_proxy/get_cas_shard -> sharder.shard_for_reads to decide which shard to use for LWT code execution on both replicas and the coordinator. If the coordinator is not a replica, shard_for_reads returns 0 — the 'default' shard. This behavior has at least two problems: * Shard 0 may become overloaded, because all LWT coordinators that are not replicas will be served on it. * The zero shard does not match shard_for_reads on replicas, which hinders the "same shard for client and server" RPC-level optimization. To fix this, we need to know whether the current node hosts a replica for the tablet corresponding to the given token. Currently, there is no API we could use for this. For historical reasons, sharder::shard_for_reads returns 0 when the node does not host the shard, which leads to ambiguity. This commit introduces try_get_shard_for_reads, which returns a disengaged std::optional when the tablet is not present on the local node. We leave shard_for_reads method in the base sharder class, it calls try_get_shard_for_reads and returns zero by default. We need to rename tablet_sharder private methods shard_for_reads and shard_for_writes so that they don't conflict with the sharder::shard_for_reads.	2025-07-29 11:35:54 +02:00
Nadav Har'El	f6a3e6fbf0	sstables: don't depend on fmt 11.1 to build A recent commit `a0c29055e5` added some trace printouts which print an std::reference_wrapper<>. Apparently a formatter for this type was only added to fmt in version 11.1.0, and it doesn't exist on earlier versions, such as fmt 11.0.2 on Fedora 41. Let's avoid requiring shiny-new versions of fmt. The workaround is easy: just unwrap the reference_wrapper - print pr.get() instead of just pr, and Scylla returns to building correctly on Fedora 41. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25228	2025-07-29 11:32:06 +02:00
Patryk Jędrzejczak	3299ffba51	Merge 'raft_group0: split shutdown into abort-and-drain and destroy' from Petr Gusev Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`). However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used. This PR reworks the shutdown logic: * Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called. * Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup. The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods. Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`. Refs #25115 Fixes #24625 Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR. Closes scylladb/scylladb#25151 * https://github.com/scylladb/scylladb: raft_group0: split shutdown into abort_and_drain and destroy Revert "main.cc: fix group0 shutdown order"	2025-07-29 10:39:00 +02:00
Asias He	e28c75aa79	repair: Avoid too many fragments in a single repair_row_on_wire When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message. This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression. This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream. Tests are added to make sure the message split works. Fixes #24808	2025-07-29 13:43:53 +08:00
Asias He	266a518e4c	repair: Change partition_key_and_mutation_fragments to use chunked_vector With the change in "repair: Avoid too many fragments in a single repair_row_on_wire", the std::list<frozen_mutation_fragment> _mfs; in partition_key_and_mutation_fragments will not contain large number of fragments any more. Switch to use chunked_vector.	2025-07-29 13:43:17 +08:00
Asias He	4a4fbae8f7	utils: Allow chunked_vector::erase to work with non-default-constructible type This is needed for chunked_vector<frozen_mutation_fragment> in repair.	2025-07-29 13:43:17 +08:00
Avi Kivity	d3cdb88fe7	tools: toolchain: dbuild: increase depth of nested podman configuration coverage The initial support for nested containers (`2d2a2ef277`) worked on my machine (tm) and even laptop, but does not work on fresh installs. This is likely due to changes in where persistent configuration is stored on the host between various podman versions; even though my podman is fully updated, it uses configuration created long ago. Make nested containers work on fresh installs by also configuring /etc/containers/storage.conf. The important piece is to set graphroot to the same location as the host. Verified both on my machine and on a fresh install. Closes scylladb/scylladb#25156	2025-07-29 08:23:41 +03:00
Botond Dénes	f3ed27bd9e	Merge 'Move feature-service config creation code out of feature-service itself' from Pavel Emelyanov Nowadays the way to configure an internal service is 1. service declares its config struct 2. caller (main/test/tool) fills the respective config with values it wants 3. the service is started with the config passed by value The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config. For the reference: similar changes with other services: #23705 , #20174 , #19166 Closes scylladb/scylladb#25118 * github.com:scylladb/scylladb: gms,init: Move get_disabled_features_from_db_config() from gms code: Update callers generating feature service config gms: Make feature_config a simple struct gms: Split feature_config_from_db_config() into two	2025-07-29 08:17:49 +03:00
Anna Stuchlik	18b4d4a77c	doc: add tablets support information to the Drivers table This commit: - Extends the Drivers support table with information on which driver supports tablets and since which version. - Adds the driver support policy to the Drivers page. - Reorganizes the Drivers page to accommodate the updates. In addition: - The CPP-over-Rust driver is added to the table. - The information about Serverless (which we don't support) is removed and replaced with tablets to correctly describe the contents of the table. Fixes https://github.com/scylladb/scylladb/issues/19471 Refs https://github.com/scylladb/scylladb-docs-homepage/issues/69 Closes scylladb/scylladb#24635	2025-07-29 08:11:42 +03:00
Avi Kivity	f7324a44a2	compaction: demote normal compaction start/end log messages to debug level Compaction is routine and the log messages pollute the log files, hiding important information. All the data is available via `nodetool compactionhistory`. Reduce noise by demoting those log messages to debug level. One test is adjusted to use debug level for compaction, since it listens for those messages. Closes scylladb/scylladb#24949	2025-07-29 08:02:22 +03:00
Nadav Har'El	e43828c10b	test/cqlpy: fix fetch_scylla.py for .0 releases The test/cqlpy/fetch_scylla.py script is used by test/cqlpy/run and test/alternator/run to implement their "--release" option - which allows you to run current tests against any official release of Scylla downloaded from Scylla's S3 bucket. When you ask to get release "2025.1", the idea is to fetch the latest release available in the 2025.1 stream - currently it is 2025.1.5. fetch_scylla.py does this by listing the available 2025.1 releases, sorting them and fetching the last one. We had a bug in the sort order - version 0 was sorted before version 0-rc1, which is incorrect (the version 2025.2.0 came after 2025.2.0~rc1). For most releases this didn't cause any problem - 0~rc1 was sorted after 0, but 5 (for example) came after both, so 2025.1.5 got downloaded. But when a release has only an rc and a .0 release, we incorrectly used the rc instead of the .0. This patch fixes the sort order by using the "/" character, which sorts before "0", in rc version strings when sorting the release numbers. Before this patch, we had this problem in "--release 2025.2" because currently 2025.2 only has RC releases (rc0 and rc1) and a .0 release, and we wrongly downloaded the rc1. After this patch, the .0 is chosen as expected: $ test/cqlpy/run --release 2025.2 Chosen download for ScyllaDB 2025.2: 2025.2.0 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-28 22:02:15 +03:00
Nadav Har'El	72358ee9f4	test/cqlpy: fix "run --release" when Scylla hasn't been built The "--release" option of test/cqlpy/run can be used to run current cqlpy tests against any official release of Scylla, which is automatically downloaded from Scylla's S3 bucket. You should be able to run tests like that even without having compiled Scylla from source. But we had a bug, where test/cqlpy/run looked for the built Scylla executable before parsing the "--release" option, and this bug is fixed in this patch. The Alternator version of the run script, test/alternator/run, doesn't need to be fixed because it already did things in the right order. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-28 21:42:02 +03:00
Taras Veretilnyk	3bc9ee10d1	nodetool: add command for dropping quarantine sstables - Add dropquarantinedsstables command to remove quarantined SSTables - Support both flag-based (--keyspace, --table) and positional arguments - Allow targeting all keyspaces, specific keyspace, or keyspace with specified tables Fixes scylladb/scylladb#19061	2025-07-28 16:55:17 +02:00
Taras Veretilnyk	fa98239ed8	rest_api: add endpoint which drops all quarantined sstables Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API. This endpoint allows dropping all quarantined SSTables either globally or for a specific keyspace and tables. Optional query parameters `keyspace` and `tables` (comma-separated table names) can be provided to limit the scope of the operation. Fixes scylladb/scylladb#19061	2025-07-28 16:55:17 +02:00
Dawid Mędrek	b41151ff1a	test: Enable RF-rack-valid keyspaces in all Python suites We're enabling the configuration option `rf_rack_valid_keyspaces` in all Python test suites. All relevant tests have been adjusted to work with it enabled. That encompasses the following suites: * alternator, * broadcast_tables, * cluster (already enabled in scylladb/scylladb@ee96f8dcfc), * cql, * cqlpy (already enabled in scylladb/scylladb@be0877ce69), * nodetool, * rest_api. Two remaining suites that use tests written in Python, redis and scylla_gdb, are not affected, at least not directly. The redis suite requires creating an instance of Scylla manually, and the tests don't do anything that could violate the restriction. The scylla_gdb suite focuses on testing the capabilities of scylla-gdb.py, but even then it reuses the `run` file from the cqlpy suite. Fixes scylladb/scylladb#25126 Closes scylladb/scylladb#24617	2025-07-28 16:32:59 +02:00
Gleb Natapov	198cfc6fe7	migration manager: do not use group0 on non zero shard Commit `ddc3b6dcf5` added a check of group0 state in get_schema_for_write(), but group0 client can only be used on shard 0, and get_schema_for_write() can be called on any shard, so we cannot use _group0_client there directly. Move assert where we use another group0 function already where it is guarantied to run on shard 0. Closes scylladb/scylladb#25204	2025-07-28 14:10:01 +02:00
Michał Chojnowski	10e742df2d	storage_service: in publish_new_sstable_dict, use _group0_as instead of the main abort source We probably want to abort this call also when the node is being DRAINED, not just when it's being shut down. So use a more appropriate abort_source.	2025-07-28 12:42:37 +02:00
Michał Chojnowski	6ba2781dff	storage_service: hold group0 gate in `publish_new_sstable_dict` When I was writing this code, I assumed that calls to `group0_client` is a friendly API that would take of keeping the group0 server alive until the client call returns. But after examining the code of `group0_client`, I think I was wrong, and that group0 must be kept alive by external means. So this patch attempts to keep group0 alive until the `publish_new_sstable_dict` call returns.	2025-07-28 12:42:37 +02:00
Nadav Har'El	b4fc3578fc	Merge 'LWT: enable for tablet-based tables' from Petr Gusev This PR enables LWT (Lightweight Transactions) support for tablet-based tables by leveraging colocated tables. Currently, storing Paxos state in system tables causes two major issues: * Loss of Paxos state during tablet migration or base table rebuilds * When a tablet is migrated or the base table is rebuilt, system tables don't retain Paxos state. * This breaks LWT correctness in certain scenarios. * Failing test cases demonstrating this: * test_lwt_state_is_preserved_on_tablet_migration * test_lwt_state_is_preserved_on_rebuild * Shard misalignment and performance overhead * Tablets may be placed on arbitrary shards by the tablet balancer. * Accessing Paxos state in system tables could require a shard jump, degrading performance. We move Paxos state into a dedicated Paxos table, colocated with the base table: * Each base table gets its own Paxos state table. * This table is lazily created on the first LWT operation. * Its tablets are colocated with those of the base table, ensuring: * Co-migration during tablet movement * Co-rebuilding with the base table * Shard alignment for local access to Paxos state Some reasoning for why this is sufficient to preserve LWT correctness is discussed in [2]. This PR addresses two issues from the "Why doesn't it work for tablets" section in [1]: * Tablet migration vs LWT correctness * Paxos table sharding Other issues ("bounce to shard" and "locking for intranode_migration") have already been resolved in previous PRs. References [1] - [LWT over tablets design](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.goufx7gx24yu) [2] - [LWT: Paxos state and tablet balancer](https://docs.google.com/document/d/1-xubDo612GGgguc0khCj5ukmMGgLGCLWLIeG6GtHTY4/edit?tab=t.0) [3] - [Colocated tables PR](https://github.com/scylladb/scylladb/pull/22906#issuecomment-3027123886) [4] - [Possible LWT consistency violations after a topology change](https://github.com/scylladb/scylladb/issues/5251) Backport: not needed because this is a new feature. Closes scylladb/scylladb#24819 * github.com:scylladb/scylladb: create_keyspace: fix warning for tablets docs: fix lwt.rst docs: fix tablets.rst alternator: enable LWT random_failures: enable execute_lwt_transaction test_tablets_lwt: add test_paxos_state_table_permissions test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft test_tablets_lwt: test timeout creating paxos state table test_tablets_lwt: add test_lwt_concurrent_base_table_recreation test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild test_tablets_lwt: migrate test_lwt_support_with_tablets test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration test_tablets_lwt: add simple test for LWT check_internal_table_permissions: handle Paxos state tables client_state: extract check_internal_table_permissions paxos_store: handle base table removal database: get_base_table_for_tablet_colocation: handle paxos state table paxos_state: use node_local_only mode to access paxos state query_options: add node_local_only mode storage_proxy: handle node_local_only in query storage_proxy: handle node_local_only in mutate storage_proxy: introduce node_local_only flag abstract_replication_strategy: remove unused using storage_proxy: add coordinator_mutate_options storage_proxy: rename create_write_response_handler -> make_write_response_handler storage_proxy: simplify mutate_prepare paxos_state: lazily create paxos state table migration_manager: add timeout to start_group0_operation and announce paxos_store: use non-internal queries qp: make make_internal_options public paxos_store: conditional cf_id filter paxos_store: coroutinize feature_service: add LWT_WITH_TABLETS feature paxos_state: inline system_keyspace functions into paxos_store paxos_state: extract state access functions into paxos_store	2025-07-28 13:19:23 +03:00
Taras Veretilnyk	6b6622e07a	docs: fix typo in command name enbleautocompaction -> enableautocompaction Renamed the file and updated all references from 'enbleautocompaction' to the correct 'enableautocompaction'. Fixes scylladb/scylladb#25172 Closes scylladb/scylladb#25175	2025-07-28 12:49:26 +03:00
Tomasz Grabiec	55116ee660	topology_coordinator: Trigger load stats refresh after replace Otherwise, tablet rebuilt will be delayed for up to 60s, as the tablet scheduler needs load stats for the new node (replacing) to make decisisons. Fixes #25163 Closes scylladb/scylladb#25181	2025-07-28 11:07:17 +02:00
Sergey Zolotukhin	ea311be12b	generic_server: Two-step connection shutdown. When shutting down in `generic_server`, connections are now closed in two steps. First, only the RX (receive) side is shut down. Then, after all ongoing requests are completed, or a timeout happened the connections are fully closed. Fixes scylladb/scylladb#24481	2025-07-28 10:08:06 +02:00
Sergey Zolotukhin	7334bf36a4	transport: consmetic change, remove extra blanks.	2025-07-28 10:08:06 +02:00
Sergey Zolotukhin	061089389c	transport: Handle sleep aborted exception in sleep_until_timeout_passes In PR #23156, a new function `sleep_until_timeout_passes` was introduced to wait until a read request times out or completes. However, the function did not handle cases where the sleep is aborted via _abort_source, which could result in WARN messages like "Exceptional future is ignored" during shutdown. This change adds proper handling for that exception, eliminating the warning.	2025-07-28 10:08:05 +02:00
Sergey Zolotukhin	27b3d5b415	generic_server: replace empty destructor with `= default` This change improves code readability by explicitly marking the destructor as defaulted.	2025-07-28 10:08:05 +02:00
Sergey Zolotukhin	3610cf0bfd	generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output` This change improves logging and modifies the behavior to attempt closing the output side of a connection even if an error occurs while closing the input side.	2025-07-28 10:08:05 +02:00
Sergey Zolotukhin	3848d10a8d	generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class. The functions are just wrappers for _fd.shutdown_input() and _fd.shutdown_output(), with added error reporting. Needed by later changes.	2025-07-28 10:08:05 +02:00
Sergey Zolotukhin	122e940872	test: Add test for query execution during CQL server shutdown This test simulates a scenario where a query is being executed while the query coordinator begins shutting down the CQL server and client connections. The shutdown process should wait until the query execution is either completed or timed out. Test for scylladb/scylladb#24481	2025-07-28 10:08:05 +02:00
Robert Bindar	d921a565de	Add open-coredump script depndencies to install-dependencies.sh Whilst the coredump script checks for prerequisites, the user experience is not ideal because you either have to go in the script and get the list of deps and install them or wait for the script to complain about lacking dependencies one by one. This commit completes the list of dependencies in the install script (some of them were already there for Fedora), so you already have them installed by the time you get to run the coredump script. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> [avi: - remove trailing whitespace - regenerate frozen toolchain Optimized clang binaries generated and stored in https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz ] Closes #22369 Closes scylladb/scylladb#25203	2025-07-28 06:45:01 +03:00
Avi Kivity	1930f3e67f	Merge 'sstables/mx/reader: accommodate inexact partition indexes' from Michał Chojnowski Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space. However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position. For example, if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first index entry after key `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.) Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges. Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome. Preparation for new functionality, no backporting needed. Closes scylladb/scylladb#25093 * github.com:scylladb/scylladb: sstables/index_reader: weaken some exactness guarantees in abstract_index_reader test/boost: add a test for inexact index lookups sstables/mx/reader: allow passing a custom index reader to the constructor sstables/index_reader: remove advance_to sstables/mx/reader: handle inexact lookups in `advance_context()` sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()` sstables/index_reader: make the return value of `get_partition_key` optional sstables/mx/reader: handle "backward jumps" in forward_to sstables/mx/reader: filter out partitions outside the queried range sstables/mx/reader: update _pr after `fast_forward_to`	2025-07-27 19:39:36 +03:00
Avi Kivity	8180cbcf48	Merge 'tablets: prevent accidental copy of tablets_map' from Benny Halevy As they are wasteful in many cases, it is better to move the tablet_map if possible, or clone it gently in an async fiber. Add clone() and clone_gently() methods to allow explicit copies. * minor optimization, no backport needed Closes scylladb/scylladb#24978 * github.com:scylladb/scylladb: tablets: prevent accidental copy of tablets_map locator: tablets: get rid of synchronous mutate_tablet_map	2025-07-27 16:48:27 +03:00
Lakshmi Narayanan Sreethar	0c5fa8e154	locator/token_metadata.cc: use chunked_vector to store _sorted_tokens The `token_metadata_impl` stores the sorted tokens in an `std::vector`. With a large number of nodes, the size of this vector can grow quickly, and updating it might lead to oversized allocations. This commit changes `_sorted_tokens` to a `chunked_vector` to avoid such issues. It also updates all related code to use `chunked_vector` instead of `std::vector`. Fixes #24876 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#25027	2025-07-27 11:29:22 +03:00
Tomasz Grabiec	a1d7722c6d	Merge 'api: repair_async: refuse repairing tablet keyspaces' from Aleksandra Martyniuk A tablet repair started with /storage_service/repair_async/ API bypasses tablet repair scheduler and repairs only the tablets that are owned by the requested node. Due to that, to safely repair the whole keyspace, we need to first disable tablet migrations and then start repair on all nodes. With the new API - /storage_service/tablets/repair - tailored to tablet repair requirements, we do not need additional preparation before repair. We may request it on one node in a cluster only and, thanks to tablet repair scheduler, a whole keyspace will be safely repaired. Both nodetool and Scylla Manager have already started using the new API to repair tablets. Refuse repairing tablet keyspaces with /storage_service/repair_async - 403 Forbidden is returned. repair_async should still be used to repair vnode keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/23008. Breaking change; no backport. Closes scylladb/scylladb#24678 * github.com:scylladb/scylladb: repair: remove unused code api: repair_async: forbid repairing tablet keyspaces	2025-07-27 09:25:42 +02:00
Piotr Dulikowski	44de563d38	Merge 'db/hints: Improve logging' from Dawid Mędrek We improve logging in critical functions in hinted handoff to capture more information about the behavior of the module. That should help us in debugging sessions. The logs should only be printed during more important events and so they should not clog the log files. Backport: not necessary. Closes scylladb/scylladb#25031 * github.com:scylladb/scylladb: db/hints/manager.cc: Add logs for changing host filter db/hints: Increase log level in critical functions	2025-07-27 09:25:42 +02:00
Michael Litvak	3ff388cd94	storage service: drain view builder before group0 The view builder uses group0 operations to coordinate view building, so we should drain the view builder before stopping group0. Fixes scylladb/scylladb#25096 Closes scylladb/scylladb#25101	2025-07-27 09:25:42 +02:00
Pavel Emelyanov	403a72918d	sstables/types.hh: Remove duplicate version.hh inclusion The latter header in included two times, one is enough Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25109	2025-07-27 09:25:42 +02:00
Pavel Emelyanov	1b9eb4cb9f	init.hh: Remove unused forward declarations The init.hh contains some bits that only main.cc needs. Some of its forward declarations are neede by neither the headers itself, nor the main.cc that includes it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25110	2025-07-27 09:25:42 +02:00
Petr Gusev	8b8b7adbe5	raft_group0: split shutdown into abort_and_drain and destroy Previously, raft_group0::abort() was called in storage_service::do_drain (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because raft::server depends on storage (via raft_sys_table_storage and group0_state_machine). However, this caused issues: services like sstable_dict_autotrainer and auth::service, which use group0_client but are not stopped by storage_service, could trigger use-after-free if raft_group0 was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used. This commit reworks the shutdown logic: * Introduces abort_and_drain(), which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see raft::stopped_error if they try to access group0 after abort_and_drain(). * Final destruction happens in a separate method destroy(), called later from main.cc. The raft_server_for_group::aborted is changed to a shared_future -- abort_server now returns a future so that we can wait for it in abort_and_drain(), it should return the future from the previous abort_server call, which can happen in the on_background_error callback. Node startup can fail before reaching storage_service, in which case ss.drain_on_shutdown() and abort_and_drain() are never called. To ensure proper cleanup, abort_and_drain() is called from main.cc before destroy(). Clients of raft_group_registry are expected to call destroy_server() for the servers they own. Currently, the only such client is raft_group0, which satisfies this requirement. As a result, raft_group_registry::stop_servers() is no longer needed. Instead, raft_group_registry::stop() now verifies that all servers have been properly destroyed. If any remain, it calls on_internal_error(). The call to drain_on_shutdown() in cql_test_env.cc appears redundant. The only source of raft::server instances in raft_group_registry is group0_service, and if group0_service.start() succeeds, both abort_and_drain() and destroy() are guaranteed to be called during shutdown.	2025-07-25 17:16:14 +02:00
Michał Chojnowski	b1da5f2d0f	sstables/index_reader: weaken some exactness guarantees in abstract_index_reader After making the sstable reader more permissive, we can weaken the abstract_index_reader interface.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	be1f54c6d2	test/boost: add a test for inexact index lookups	2025-07-25 11:00:18 +02:00
Michał Chojnowski	810eb93ff0	sstables/mx/reader: allow passing a custom index reader to the constructor For tests. Will be used for testing how the data reader reacts to various combinations of inexact index lookup results.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	fe8ee34024	sstables/index_reader: remove advance_to `advance_to` is unused now, so remove it.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	03bf6347e2	sstables/mx/reader: handle inexact lookups in `advance_context()` `advance_context()` needs an ability to advance the index to the partition immediately following the reader's current partition. For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)` But BTI (and any index format which stores only the prefixes of keys instead of whole keys) can't implement `advance_to` with its current semantics. The Data position returned by the index for a generic `advance_to` might be off by one partition. E.g. if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first entry after `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. However, BTI can be used exactly if the partition is known to be present in the sstable. (In the above example, if `bb` is known to be present in the sstable, then it must correspond to `b`. So the index can reliably advance to `bb` or the first partition after it). And this is enough for `advance_context()`, because the current partition is known to be present. So we can replace the usage of `advance_to` with an equivalent API call which only works with present keys, but in exchange is implementable by BTI. This makes `advance_to` unused, so we remove it.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	11792850dd	sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()` `advance_to_next_partition()` needs an ability to advance the index to the partition immediately following the reader's current partition. For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)` But BTI (and any index format which stores only the prefixes of keys instead of whole keys) can't implement `advance_to` with its current semantics. The Data position returned by the index for a generic `advance_to` might be off by one partition. E.g. if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first entry after `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. However, BTI can be used exactly if the partition is known to be present in the sstable. (In the above example, if `bb` is known to be present in the sstable, then it must correspond to `b`. So the index can reliably advance to `bb` or the first partition after it). And this is enough for `advance_to_next_partition()`, because the current partition is known to be present. So we can replace the usage of `advance_to` with an equivalent API call which only works with present keys, but in exchange is implementable by BTI.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	141895f9eb	sstables/index_reader: make the return value of `get_partition_key` optional BTI indexes only store encoded prefixes of partition keys, not the whole keys. They can't reliably implement `get_partition_key`. The index reader interface must be weakened and callers must be adapted.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	a0c29055e5	sstables/mx/reader: handle "backward jumps" in forward_to A bunch of code assumes that the Data.db stream can only go forward. But with BTI indexes, if we perform an advance_to, the index can point to a position which the data reader has already passed, since the index is inexact. The logic of the data reader ensures that it has stopped within the last partition range, or just immediately after it, after reading the next partition key and noticing that it doesn't belong to the range. But forward_to can only be used with increasing ranges. The start of the next range must be greater or equal to the end of the previous range. This means that the exact start of the next partition range must be no earlier than: 1. Before the partition key just read by the data reader, if the data reader is positioned immediately after a partition key. 2. The start of the first partition after the current data reader position, if the data reader isn't positioned immediately after a partition key. So, if the index returns a position smaller than the current data reader position, then: 1. If the reader is immediately after a partition key, we have to reuse this partition key (since we can't go back in the stream to read it again), and keep reading from the current position. 2. Otherwise we can safely walk the index to the first partition that lies no earlier than the current position.	2025-07-25 10:49:58 +02:00
Michał Chojnowski	218b2dffff	sstables/mx/reader: filter out partitions outside the queried range The current index format is exact: it always returns the position of the first partition in the queried partition range. But we are about the add an index format where that doesn't have to be the case. In BTI indexes, the lookup can be off by one partition sometimes. This patch prepares the reader for that, by skipping the partitions which were read by the data reader but don't belong to the queried range. Note: as of this patch, only the "normal path" is ever used. We add tests exercising these code paths later. Also note that, as of this patch, actually stepping outside the queried range would cause the reader to end up in a state where the underlying parser is positioned right after partition key immediately following the queried range. If the reader was forwarded to that key in this state, it would trip an assert, because the parser can't handle backward jumps. We will add logic to handle this case in the next patch.	2025-07-25 10:49:57 +02:00
Michał Chojnowski	2b81fdf09b	sstables/mx/reader: update _pr after `fast_forward_to` In later patches, we will prepare the reader for inexact index implementations (ones which can return a Data file range that includes some partitions before or after the queried range). For that, we will need to filter out the partitions outside of the range, and for that we need to remember the range. This is the goal of this patch. Note that we are storing a reference to an argument of `fast_forward_to`. This is okay, because the contract of `mutation_reader` specifies that the caller must keep `pr` alive until the next `fast_forward_to` or until the reader is destroyed.	2025-07-25 10:49:57 +02:00
Aleksandra Martyniuk	a7ee2bbbd8	tasks: do not use binary progress for task manager tasks Currently, progress of a parent task depends on expected_total_workload, expected_children_number, and children progresses. Basically, if total workload is known or all children have already been created, progresses of children are summed up. Otherwise binary progress is returned. As a result, two tasks of the same type may return progress in different units. If they are children of the same task and this parent gathers the progress - it becomes meaningless. Drop expected_children_number as we can't assume that children are able to show their progresses. Modify get_progress method - progress is calculated based on children progresses. If expected_total_workload isn't specified, the total progress of a task may grow. If expected_total_workload isn't specified and no children are created, empty progress (0/0) is returned. Fixes: https://github.com/scylladb/scylladb/issues/24650. Closes scylladb/scylladb#25113	2025-07-25 10:45:32 +03:00
Ran Regev	7c68ee06bf	cleanup: remove partition_slice_builder from include Refs: #22099 (issue) Refs: #25079 (pr) remove include for partition_slice_builder that is not used. makes it clear that group0_state_machine.cc does not depend on partition_slice_builder Closes scylladb/scylladb#25125	2025-07-25 10:45:32 +03:00
Ran Regev	db4f301f0c	scylla.yaml: add recommended value for stream_io_throughput_mb_per_sec Fixes: #24758 Updated scylla.yaml and the help for scylla --help Closes scylladb/scylladb#24793	2025-07-25 10:45:32 +03:00
Ferenc Szili	7ce96345bf	test: remove test_tombstone_gc_disabled_on_pending_replica The test test_tombstone_gc_disabled_on_pending_replica was added when we fixed (#20788) the potential problem with data resurrection during file based streaming. The issue was occurring only in Enterprise, but we added the fix in OSS to limit code divergence. This test was added together with the fix in OSS with the idea to guard this change in OSS. The real reproducer and test for this fix was added later, after the fix was ported into Enterprise. It is in: test/cluster/test_resurrection.py Since Enterprise has been merged into OSS, there is no more need to keep the test test_tombstone_gc_disabled_on_pending_replica. Also, it is flaky with very low probability of failure, making it difficult to investigate the cause of failure. Fixes: #22182 Closes scylladb/scylladb#25134	2025-07-25 10:45:32 +03:00
Botond Dénes	837424f7bb	Merge 'Add Azure Key Provider for Encryption at Rest' from Nikos Dragazis This PR introduces a new Key Provider to support Azure Key Vault as a Key Management System (KMS) for Encryption at Rest. The core design principle is the same as in the AWS and GCP key providers - an externally provided Vault key that is used to protect local data encryption keys (a process known as "key wrapping"). In more detail, this patch series consists of: * Multiple Azure credential sources, offering a variety of authentication options (Service Principals, Managed Identities, environment variables, Azure CLI). * The Azure host - the Key Vault endpoint bridge. * The Azure Key Provider - the interface for the Azure host. * Unit tests using real Azure resources (credentials and Vault keys). * Log filtering logic to not expose sensitive data in the logs (plaintext keys, credentials, access tokens). This is part of the overall effort to support Azure deployments. Testing done: * Unit tests. * Manual test on an Azure VM with a Managed Identity. * Manual test with credentials from Azure CLI. * Manual test of `--azure-hosts` cmdline option. * Manual test of log filtering. Remaining items: - [x] Create necessary Azure resources for CI. - [x] Merge pipeline changes (https://github.com/scylladb/scylla-pkg/pull/5201). Closes https://github.com/scylladb/scylla-enterprise/issues/1077. New feature. No backport is needed. Closes scylladb/scylladb#23920 * github.com:scylladb/scylladb: docs: Document the Azure Key Provider test: Add tests for Azure Key Provider pylib: Add mock server for Azure Key Vault encryption: Define and enable Azure Key Provider encryption: azure: Delegate hosts to shard 0 encryption: Add Azure host cache encryption: Add config options for Azure hosts encryption: azure: Add override options encryption: azure: Add retries for transient errors encryption: azure: Implement init() encryption: azure: Implement get_key_by_id() encryption: azure: Add id-based key cache encryption: azure: Implement get_or_create_key() encryption: azure: Add credentials in Azure host encryption: azure: Add attribute-based key cache encryption: azure: Add skeleton for Azure host encryption: Templatize get_{kmip,kms,gcp}_host() encryption: gcp: Fix typo in docstring utils: azure: Get access token with default credentials utils: azure: Get access token from Azure CLI utils: azure: Get access token from IMDS utils: azure: Get access token with SP certificate utils: azure: Get access token with SP secret utils: rest: Add interface for request/response redaction logic utils: azure: Declare all Azure credential types utils: azure: Define interface for Azure credentials utils: Introduce base64url_{encode,decode}	2025-07-25 10:45:32 +03:00
Ernest Zaslavsky	d2c5765a6b	treewide: Move keys related files to a new keys directory As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system. Moved files: - clustering_bounds_comparator.hh - keys.cc - keys.hh - clustering_interval_set.hh - clustering_key_filter.hh - clustering_ranges_walker.hh - compound_compat.hh - compound.hh - full_position.hh Fixes: #22102 Fixes: #22103 Fixes: #22105 Closes scylladb/scylladb#25082	2025-07-25 10:45:32 +03:00
Calle Wilund	a86e8d73f2	encryption_at_rest_test: ensure proxy connection flushing Refs #24551 Drops background flush for proxy output stream (because test), and also ensures we do explicit flush + close on exception in write loop. Ensures we don't hide actual exceptions with asserts. Closes scylladb/scylladb#25146	2025-07-25 10:45:32 +03:00
Petr Gusev	aae5260147	create_keyspace: fix warning for tablets Remove LWT from the list of unsupported features.	2025-07-24 20:04:43 +02:00
Petr Gusev	1f5d9ace93	docs: fix lwt.rst Add a new section about Paxos state tables. Update all references to system.paxos in the text to refer to this section.	2025-07-24 20:04:43 +02:00
Petr Gusev	69017fb52a	docs: fix tablets.rst LWT and Alternator are now supported with tablets.	2025-07-24 20:04:43 +02:00
Petr Gusev	abab025d4f	alternator: enable LWT	2025-07-24 20:04:43 +02:00
Petr Gusev	e4fba1adfe	random_failures: enable execute_lwt_transaction Fixes scylladb/scylladb#24502	2025-07-24 19:48:09 +02:00
Petr Gusev	84b74d6895	test_tablets_lwt: add test_paxos_state_table_permissions	2025-07-24 19:48:09 +02:00
Petr Gusev	c7cfba726d	test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft This test checks that LWT for tablets requires raft-based schema management.	2025-07-24 19:48:09 +02:00
Petr Gusev	529d2b949e	test_tablets_lwt: test timeout creating paxos state table	2025-07-24 19:48:09 +02:00
Petr Gusev	a9ef221ae8	test_tablets_lwt: add test_lwt_concurrent_base_table_recreation The test checks that we correctly handle the case when the base table is recreated during LWT execution.	2025-07-24 19:48:08 +02:00
Petr Gusev	e8e2419df6	test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild This test checks that the paxos state is preserved in case of tablet rebuild. This happens e.g. when a node is lost permanently and another node is started to replace it.	2025-07-24 19:48:08 +02:00
Petr Gusev	ff2c22ba6a	test_tablets_lwt: migrate test_lwt_support_with_tablets LWT is now supported for tablets, but this requires LWT_WITH_TABLETS feature. We migrate the test so that it checks the error messages in case the feature is not supported.	2025-07-24 19:48:08 +02:00
Petr Gusev	e0c4dc350c	test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration This test verifies that Paxos state is correctly migrated when the base table's tablet is migrated. This test fails if Paxos state is stored in system.paxos, as the final Paxos read would reflect conflicting outcomes from both prior LWT operations.	2025-07-24 19:48:08 +02:00
Petr Gusev	c11e1aef5c	test_tablets_lwt: add simple test for LWT We add/remove the base table several times to check that paxos state table is properly recreated.	2025-07-24 19:48:08 +02:00
Petr Gusev	78aa36b257	check_internal_table_permissions: handle Paxos state tables CDC and $paxos tables are managed internally by Scylla. Users are already prohibited from running ALTER and DROP commands on CDC tables. In this commit, we extend the same restrictions to $paxos tables to prevent users from shooting themselves in the foot. Other commands are generally allowed for CDC and $paxos tables. An important distinction is that CDC tables are meant to be accessed directly by users, so appropriate permissions must be set for non-superusers. In contrast, $paxos tables are not intended for direct access by users. Therefore, this commit explicitly disallows non-superusers from accessing them. Superusers are still allowed access for debugging and troubleshooting purposes. Note that these restrictions apply even if explicit permissions have been granted. For example, a non-superuser may be granted SELECT permissions on a $paxos table, but the restriction above will still take precedence. We don't try to restrict users from giving permissions to $paxos tables for simplicity.	2025-07-24 19:48:08 +02:00
Petr Gusev	ec3c5f4cbc	client_state: extract check_internal_table_permissions This is a refactoring commit — it extracts the CDC permissions handling logic into a separate function: check_internal_table_permissions. This is a preparatory step for the next commit, where we'll handle paxos state tables similarly to CDC tables.	2025-07-24 19:48:08 +02:00
Petr Gusev	bb4e7a669f	paxos_store: handle base table removal Subscribe to on_before_drop_column_family to drop the associated Paxos state table when the corresponding user table is dropped.	2025-07-24 19:48:08 +02:00
Petr Gusev	1b70623908	database: get_base_table_for_tablet_colocation: handle paxos state table We need to mark paxos state table as colocated with the user table, so that the corresponding tablets are migrated/repaired together.	2025-07-24 19:48:08 +02:00
Petr Gusev	03aa2e4823	paxos_state: use node_local_only mode to access paxos state	2025-07-24 19:48:08 +02:00
Petr Gusev	ff1caa9798	query_options: add node_local_only mode We want to access the paxos state table only on the local node and shard (or shards in case of intranode_migration). In this commit we add a node_local_only flag to query_options, which allows to do that. This flag can be set for a query via make_internal_options. We handle this flag on the statements layer by forwarding it to either coordinator_query_options or coordinator_mutate_options.	2025-07-24 19:48:08 +02:00
Petr Gusev	65c7e36b7c	storage_proxy: handle node_local_only in query In this commit we support node_local_only flag in read code path in storage_proxy.	2025-07-24 19:48:08 +02:00
Petr Gusev	2d747d97b8	storage_proxy: handle node_local_only in mutate We add the remove_non_local_host_ids() helper, which will be used in the next commit to support the read path. HostIdVector concept is introduced to be able to handle both host_id_vector_replica_set and host_id_vector_topology_change uniformly. The storage_proxy_coordinator_mutate_options class is declared outside of storage_proxy to avoid C++ compiler complaints about default field initializers. In particular, some storage_proxy methods use this class for optional parameters with default values, which is not allowed when the class is defined inside storage_proxy.	2025-07-24 19:48:08 +02:00
Petr Gusev	7eb198f2cc	storage_proxy: introduce node_local_only flag Add a per-request flag that restricts query execution to the local node by filtering out all non-local replicas. Standard consistency level (CL) rules still apply: if the local node alone cannot satisfy the requested CL, an exception is thrown. This flag is required for Paxos state access, where reads and writes must target only the local node. As a side effect, this also enables the implementation of scylladb/scylladb#16478, which proposes a CQL extension to expose 'local mode' query execution to users. Support for this flag in storage_proxy's read and write code paths will be added in follow-up commits.	2025-07-24 19:48:08 +02:00
Petr Gusev	8e745137de	abstract_replication_strategy: remove unused using	2025-07-24 19:48:08 +02:00
Petr Gusev	4c1aca3927	storage_proxy: add coordinator_mutate_options In upcoming commits, we want to add a node_local_only flag to both read and write paths in storage_proxy. This requires passing the flag from query_processor to the part of storage_proxy where replica selection decisions are made. For reads, it's sufficient to add the flag to the existing coordinator_query_options class. For writes, there is no such options container, so we introduce coordinator_mutate_options in this commit. In the future, we may move some of the many mutate() method arguments into this container to simplify the code.	2025-07-24 19:48:08 +02:00
Petr Gusev	b6ccaffd45	storage_proxy: rename create_write_response_handler -> make_write_response_handler Most of the create_write_response_handler overloads follow the same signature pattern to satisfy the sp::mutate_prepare call. The one which doesn't follow it is invoked by others and is responsible for creating a concrete handler instance. In this refactoring commit we rename it to make_write_response_handler to reduce confusion.	2025-07-24 19:48:08 +02:00
Petr Gusev	db946edd1d	storage_proxy: simplify mutate_prepare This is a refactoring commit. We remove extra lambda parameters from mutate_prepare since the CreateWriteHandler lambda can simply capture them. We can't std::move(permit) in another mutate_prepare overload, because each handler wants its own copy of this pemit.	2025-07-24 19:48:08 +02:00
Petr Gusev	ac4bc3f816	paxos_state: lazily create paxos state table We call paxos_store::ensure_initialized in the beginning of storage_proxy::cas to create a paxos state table for a user table if it doesn't exist. When the LWT coordinator sends RPCs to replicas, some of them may not yet have the paxos schema. In paxos_store::get_paxos_state_schema we just wait for them to appear, or throw 'no_such_column_family' if the base table was dropped.	2025-07-24 19:48:08 +02:00
Dawid Mędrek	b559c1f0b6	db/hints/manager.cc: Add logs for changing host filter We add new logs when the host filter is undergoing a change. It should not happen very often and so it shouldn't clog the log files. At the same time, it provides us with useful information when debugging.	2025-07-24 17:45:34 +02:00
Dawid Mędrek	cb0cd44891	db/hints: Increase log level in critical functions We increase the log level in more important functions to capture more information about the behavior of hints. All of the promoted logs are printed rarely, so they should not clog the log files, but at the same time they provide more insight into what has already happened and what has not.	2025-07-24 17:41:54 +02:00
Petr Gusev	3e0347c614	migration_manager: add timeout to start_group0_operation and announce Pass a timeout parameter through to start_operation() and add_entry(), respectively. This is a preparatory change for the next commit, which will use the timeout to properly handle timeouts during lazy creation of Paxos state tables.	2025-07-24 16:39:50 +02:00
Petr Gusev	519f40a95e	paxos_store: use non-internal queries Switch paxos_store from using internal queries to regular prepared queries, so that prepared statements are correctly updated when the base table is recreated. The do_execute_cql_with_timeout function is extracted to reduce code bloat when execute_cql_with_timeout template function is instantiated. We change return type of execute_cql_with_timeout to untyped_result_set since shared_ptr is not really needed here.	2025-07-24 16:39:50 +02:00
Petr Gusev	6caa1ae649	qp: make make_internal_options public In upcoming commits, we will switch paxos_store from using internal queries to regular prepared queries, so that prepared statements are correctly updated when the base table is recreated. To support this, we want to reuse the logic for converting parameters from vector<data_value_or_unset> to raw_value_vector_with_unset. This commit makes make_internal_options public to enable that reuse.	2025-07-24 16:39:50 +02:00
Petr Gusev	13f7266052	paxos_store: conditional cf_id filter We want to reuse the same queries to access system.paxos and the the co-located table. A separate co-located table will be created for each user table, so we won't need cf_id filter for them. In this commit we make cf_if filter optional and apply it only if the stable table is actually system.paxos.	2025-07-24 16:39:50 +02:00
Petr Gusev	370f91adb7	paxos_store: coroutinize This is another preparational step. We want to add more logic to paxos_store state access functions in the next commits, it's easier to do with coroutines. Pass ballot by value to delete_paxos_decision because paxos_state::prune is not a coroutine and the ballot parameter is destroyed when we return from it. The alternative solution -- pass by const reference to paxos_state::prune -- doesn't work because paxos_state::prune is called from a lambda in paxos_response_handler::prune, this lambda is not a coroutine and the 'ballot' field could be destroyed along with the body of this lambda as soon as we return from paxos_state::prune.	2025-07-24 16:39:50 +02:00
Petr Gusev	ab03badc15	feature_service: add LWT_WITH_TABLETS feature We will need this feature to determine if it's safe to enable LWTs for a tablet-based table.	2025-07-24 16:39:50 +02:00
Petr Gusev	8292ecf2e1	paxos_state: inline system_keyspace functions into paxos_store Prepares for reusing the same functions to access either system.paxos or a co-located table.	2025-07-24 16:39:50 +02:00
Petr Gusev	6e87a6cdb0	paxos_state: extract state access functions into paxos_store Introduce paxos_store abstraction to isolate Paxos state access. Prepares for supporting either system.paxos or a co-located table as the storage backend.	2025-07-24 16:39:50 +02:00
Gleb Natapov	d5e023bbad	topology coordinator: drop no longer needed token metadata barrier Currently we do token metadata barrier before accepting a replacing node. It was needed for the "replace with the same IP" case to make sure old request will not contact new node by mistake. But now since we address nodes by id this is no longer possible since old requests will use old id and will be rejected. Closes scylladb/scylladb#25047	2025-07-24 11:15:42 +02:00
Aleksandra Martyniuk	1767eb9529	repair: remove unused code	2025-07-24 11:11:12 +02:00
Aleksandra Martyniuk	a0031ad05e	api: repair_async: forbid repairing tablet keyspaces Return 403 Forbidden if a user tries to repair tablet keyspace with /storage_service/repair_async/ API.	2025-07-24 11:11:09 +02:00
Tomasz Grabiec	c9bf010d6d	Merge 'test.py: skip cleaning testlog' from Andrei Chekun Skip removing any artifacts when -s provided between test.py invocation. Logs from the previous run will be overridden if tests were executed one more time. Fox example: 1. Execute tests A, B, C with parameter -s 2. All logs are present even if tests are passed 3. Execute test B with parameter -s 4. Logs for A and C are from the first run 5. Logs for B are from the most recent run Backport is not needed, since it framework enhancement. Closes scylladb/scylladb#24838 * github.com:scylladb/scylladb: test.py: skip cleaning artifacts when -s provided test.py: move deleting directory to prepare_dir	2025-07-24 09:46:42 +03:00
Gleb Natapov	ab6e328226	storage_proxy: preallocate write response handler hash table Currently it grows dynamically and triggers oversized allocation warning. Also it may be hard to find sufficient contiguous memory chunk after the system runs for a while. This patch pre-allocates enough memory for ~1M outstanding writes per shard. Fixes #24660 Fixes #24217 Closes scylladb/scylladb#25098	2025-07-24 09:46:42 +03:00
Patryk Jędrzejczak	f89ffe491a	Merge 'storage_service: cancel all write requests after stopping transports' from Sergey Zolotukhin When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore. If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out. This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped. Fixes scylladb/scylladb#23665 Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3. Closes scylladb/scylladb#24714 * https://github.com/scylladb/scylladb: storage_service: Cancel all write requests on storage_proxy shutdown test: Add test for unfinished writes during shutdown and topology change	2025-07-24 09:46:42 +03:00
Michał Chojnowski	0ca983ea91	utils/bit_cast: add object_representation() An util that casts a trivial object to the span of its bytes.	2025-07-23 17:03:05 +02:00
Patryk Jędrzejczak	f408d1fa4f	docs: document the option to set recovery_leader later In one of the previous commits, we made it possible to set `recovery_leader` on each node just before restarting it. Here, we update the corresponding documentation.	2025-07-23 15:36:57 +02:00
Patryk Jędrzejczak	9e45e1159b	test: delay setting recovery_leader in the recovery procedure tests In the previous commit, we made it possible to set `recovery_leader` on each node just before restarting it. Here, we change all the tests of the Raft-based recovery procedure to use and test this option.	2025-07-23 15:36:57 +02:00
Patryk Jędrzejczak	ba5b5c7d2f	gossip: add recovery_leader to gossip_digest_syn In the new Raft-based recovery procedure, live nodes join the new group 0 one by one during a rolling restart. There is a time window when some of them are in the old group 0, while others are in the new group 0. This causes a group 0 mismatch in `gossiper::handle_syn_msg`. The current solution for this problem is to ignore group 0 mismatches if `recovery_leader` is set on the local node and to ask the administrator to perform the rolling restart in the following way: - set `recovery_leader` in `scylla.yaml` on all live nodes, - send the `SIGHUP` signal to all Scylla processes to reload the config, - proceed with the rolling restart. This commit makes `gossiper::handle_syn_msg` ignore group 0 mismatches when exactly one of the two gossiping nodes has `recovery_leader` set. We achieve this by adding `recovery_leader` to `gossip_digest_syn`. This change makes setting `recovery_leader` earlier on all nodes and reloading the config unnecessary. From now on, the administrator can simply restart each node with `recovery_leader` set. However, note that nodes that join group 0 must have `recovery_leader` set until all nodes join the new group 0. For example, assume that we are in the middle of the rolling restart and one of the nodes in the new group 0 crashes. It must be restarted with `recovery_leader` set, or else it would reject `gossip_digest_syn` messages from nodes in the old group 0. To avoid problems in such cases, we will continue to recommend setting `recovery_leader` in `scylla.yaml` instead of passing it as a command line argument.	2025-07-23 15:36:57 +02:00
Patryk Jędrzejczak	23f59483b6	db: system_keyspace: peers_table_read_fixup: remove rows with null host_id Currently, `peers_table_read_fixup` removes rows with no `host_id`, but not with null `host_id`. Null host IDs are known to appear in system tables, for example in `system.cluster_status` after a failed bootstrap. We better make sure we handle them properly if they ever appear in `system.peers`. This commit guarantees that null UUID cannot belong to `loaded_endpoints` in `storage_service::join_cluster`, which in particular ensures that we throw a runtime error when a user sets `recovery_leader` to null UUID during the recovery procedure. This is handled by the code verifying that `recovery_leader` belongs to `loaded_endpoints`.	2025-07-23 15:36:56 +02:00
Patryk Jędrzejczak	445a15ff45	db/config, gms/gossiper: change recovery_leader to UUID We change the type of the `recovery_leader` config parameter and `gossip_config::recovery_leader` from sstring to UUID. `recovery_leader` is supposed to store host ID, so UUID is a natural choice. After changing the type to UUID, if the user provides an incorrect UUID, parsing `recovery_leader` will fail early, but the start-up will continue. Outside the recovery procedure, `recovery_leader` will then be ignored. In the recovery procedure, the start-up will fail on: ``` throw std::runtime_error( "Cannot start - Raft-based topology has been enabled but persistent group 0 ID is not present. " "If you are trying to run the Raft-based recovery procedure, you must set recovery_leader."); ```	2025-07-23 15:36:56 +02:00
Patryk Jędrzejczak	ec69028907	db/config, utils: allow using UUID as a config option We change the `recovery_leader` option to UUID in the following commit.	2025-07-23 15:36:45 +02:00
Gleb Natapov	ddc3b6dcf5	migration manager: assert that if schema pull is disabled the group0 is not in use_pre_raft_procedures state If schema pull are disabled group0 is used to bring up to date schema by calling start_group0_operation() which executes raft read barrier internally, but if the group0 is still in use_pre_raft_procedures start_group0_operation() silently does nothing. Later the code that assumes that schema is already up-to-date will fail and print warnings into the log. But since getting queries in the state when a node is in raft enabled mode but group0 is still not configured is illegal it is better to make those errors more visible buy asserting them during testing. Closes scylladb/scylladb#25112	2025-07-23 14:10:17 +02:00
Petr Gusev	41a67510bb	Revert "main.cc: fix group0 shutdown order" This reverts commit `6b85ab79d6`.	2025-07-23 12:11:01 +02:00
Botond Dénes	b65a2e2303	Update seastar submodule * seastar 26badcb1...60b2e7da (42): > Revert "Fix incorrect defaults for io queue iops/bandwidth" > fair_queue: Ditch queue-wide accumulator reset on overflow > addr2line, scripts/stall-analyser: change the default tool to llvm-addr2line > Fix incorrect defaults for io queue iops/bandwidth > core/reactor: add cxx_exceptions() getter > gate: make destructor virtual > scripts/seastar-addr2line: change the default addr2line utility to llvm-addr2line > coding-style: Align example return types > reactor: Remove min_vruntime() declaration > reactor: Move enable_timer() method to private section > smp: fix missing span include > core: Don't keep internal errors counter on reactor > pollable_fd: Untangle shutdown() > io_queue: Remove deprecated statistics getters > fair_queue: Remove queued/executing resource counters > reactor: Move set_current_task() from public reactor API > util: make SEASTAR_ASSERT() failure generate SIGABRT > core: fix high CPU use at idle on high core count machines > Merge 'Move output IO throttler to IO queue level' from Pavel Emelyanov fair_queue: Move io_throttler to io_queue.hh fair_queue: Move metrics from to io_queue::stream fair_queue: Remove io_throttler from tests fair_queue_test: Remove io-throttler from fair-queue fair_queue: Remove capacity getters fair_queue: Move grab_result into io_queue::stream too fair_queue: Move throtting code to io_queue.cc fair_queue: Move throttling code to io_queue::stream class fair_queue: Open-code dispatch_requests() into users fair_queue: Split dispatch_requests() into top() and pop_front() fair_queue: Swap class push back and dispatch fair_queue: Configure forgiving factor externally fair_queue: Move replenisher kick to dispatch caller io_queue: Introduce io_queue::stream fair_queue: Merge two grab_capacity overloads fair_queue: Detatch outcoming capacity grabbing from main dispatch loop fair_queue: Move available tokens update into if branch io_queue: Rename make_fair_group_config into configure_throttler io_queue: Rename get_fair_group into get_throttler fair_queue: Rename fair_group -> io_throttler > http::reply: Add 308 (permanent redirect) and make pretty-print handle unknown values > Merge 'Relax reactor coupling with file_data_source_impl' from Pavel Emelyanov reactor: Relax friendship with file_data_source_impl fstream: Use direct io_stats reference > thread_pool: Relax coupling with reactor > reactor: Mark some IO classes management methods private > http: Deprecate json_exception > io_tester: Collect and report disk queue length samples > test/perf: Add context-switch measurer > http/client: Zero-copy forward content-length body into the underlying stream > json2code: Genrate move constructor and move-assignment operator > Merge 'Semi-mixed mode for output_stream' from Pavel Emelyanov output_stream: Support semi-mixed mode writing output_stream: Complete write(temporary_buffer) piggy-back-ing write(packet) iostream: Add friends for iostream tests packet: Mark bool cast operator const iostream: Document output_stream::write() methods > io_tester: Show metrics about requests split > reactor: add counter for internal errors > iotune: Print correct throughput units > core: add label to io_threaded_fallbacks to categorize operations > slab: correct allocation logic and enforce memory limits > Merge 'Fix for non-json http function_handlers' from Travis Downs httpd_test: add test for non-JSON function handler function_handlers: avoid implicit conversions http: do not always treat plain text reply as json > Merge 'tls: add ALPN support' from Łukasz Kurowski tls: add server-side ALPN support tls: add client-side ALPN support > Merge 'coroutine: experimental: generator: implement move and swap' from Benny Halevy coroutine: experimental: generator: implement move and swap coroutine: experimental: generator: unconstify buffer capacity > future: downgrade asserts > output_stream: Remove unused bits > Merge 'Upstream a couple of minor reactor optimizations' from Travis Downs Match type for pure_check_for_work Do not use std::function for check_for_work() > Handle ENOENT in getgrnam Includes scylla-gdb.py update by Pavel Emelyanov. Closes scylladb/scylladb#25094	2025-07-22 18:19:58 +02:00
Pavel Emelyanov	2df1945f2a	compaction: Pass "reason" to perform_task_on_all_files() This tells "cleanup", "rewrite" and "split" reasons from each other Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-22 18:53:10 +03:00
Pavel Emelyanov	08c8c03a20	compaction: Pass "reason" to run_with_compaction_disabled() This tells "cleanup" (done via try_perform_cleanup) and prepares the ground for more callers (see next patch) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-22 18:52:09 +03:00
Pavel Emelyanov	db46da45d2	compaction: Pass "reason" to stop_and_disable_compaction() This tells "truncate" operation from other reasons Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-22 18:51:16 +03:00
Sergey Zolotukhin	e0dc73f52a	storage_service: Cancel all write requests on storage_proxy shutdown During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown` as one of the first steps. However, even after RPCs are shut down, some write handlers in `storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM. Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block the messaging server shutdown and delay the entire shutdown process until the write timeout occurs. This change introduces the cancellation of all outstanding write handlers in `storage_proxy` during shutdown to prevent unnecessary delays. Fixes scylladb/scylladb#23665	2025-07-22 15:03:30 +02:00
Sergey Zolotukhin	bc934827bc	test: Add test for unfinished writes during shutdown and topology change This test reproduces an issue where a topology change and an ongoing write query during query coordinator shutdown can cause the node to get stuck. When a node receives a write request, it creates a write handler that holds a copy of the current table's ERM (Effective Replication Map). The ERM ensures that no topology or schema changes occur while the request is being processed. After the query coordinator receives the required number of replica write ACKs to satisfy the consistency level (CL), it sends a reply to the client. However, the write response handler remains alive until all replicas respond — the remaining writes are handled in the background. During shutdown, when all network connections are closed, these responses can no longer be received. As a result, the write response handler is only destroyed once the write timeout is reached. This becomes problematic because the ERM held by the handler blocks topology or schema change commands from executing. Since shutdown waits for these commands to complete, this can lead to unnecessary delays in node shutdown and restarts, and occasional test case failures. Test for: scylladb/scylladb#23665	2025-07-22 15:03:13 +02:00
Ran Regev	3d82b9485e	docs: update nodetool restore documentation for --sstables-file-list Fixes: #25128 A leftover from #25077 Closes scylladb/scylladb#25129	2025-07-22 14:43:35 +02:00
Yaron Kaikov	4445c11c69	./github/workflows/conflict_reminder: improve workflow with weekly notifications - Change schedule from twice weekly (Mon/Thu) to once weekly (Mon only) - Extend notification cooldown period from 3 days to 1 week - Prevent notification spam while maintaining immediate conflict detection on pushes Fixes: https://github.com/scylladb/scylladb/issues/25130 Closes scylladb/scylladb#25131	2025-07-22 15:21:12 +03:00
Benny Halevy	fce6c4b41d	tablets: prevent accidental copy of tablets_map As they are wasteful in many cases, it is better to move the tablet_map if possible, or clone it gently in an async fiber. Add clone() and clone_gently() methods to allow explicit copies. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-22 15:07:26 +03:00
Benny Halevy	dee0d7ffbf	locator: tablets: get rid of synchronous mutate_tablet_map It is currently used only by tests that could very well do with mutate_tablet_map_async. This will simplify the following patch to prevent accidental copy of the tablet_map, provding explicit clone/clone_gently methods. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-22 15:03:02 +03:00
Avi Kivity	e4c4141d97	test.py: don't crash on early cleanup of ScyllaServer If a test fails very early (still have to find why), test.py crashes while flushing a non-existent log_file, as shown below. To fix, initialize the property to None and check it during cleanup. ``` ================================================================================ [N/TOTAL] SUITE MODE RESULT TEST ------------------------------------------------------------------------------ 'ScyllaServer' object has no attribute 'log_file' test_cluster_features Traceback (most recent call last): File "/home/avi/scylla-maint/./test.py", line 816, in <module> sys.exit(asyncio.run(main())) ~~~~~~~~~~~^^^^^^^^ File "/usr/lib64/python3.13/asyncio/runners.py", line 195, in run return runner.run(main) ~~~~~~~~~~^^^^^^ File "/usr/lib64/python3.13/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^ File "/usr/lib64/python3.13/asyncio/base_events.py", line 725, in run_until_complete return future.result() ~~~~~~~~~~~~~^^ File "/home/avi/scylla-maint/./test.py", line 523, in main total_tests_pytest, failed_pytest_tests = await run_all_tests(signaled, options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/avi/scylla-maint/./test.py", line 452, in run_all_tests failed += await reap(done, pending, signaled) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/avi/scylla-maint/./test.py", line 418, in reap result = coro.result() File "/home/avi/scylla-maint/test/pylib/suite/python.py", line 143, in run return await super().run(test, options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/avi/scylla-maint/test/pylib/suite/base.py", line 216, in run await test.run(options) File "/home/avi/scylla-maint/test/pylib/suite/topology.py", line 48, in run async with get_cluster_manager(self.uname, self.suite.clusters, str(self.suite.log_dir)) as manager: ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.13/contextlib.py", line 221, in __aexit__ await anext(self.gen) File "/home/avi/scylla-maint/test/pylib/scylla_cluster.py", line 2006, in get_cluster_manager await manager.stop() File "/home/avi/scylla-maint/test/pylib/scylla_cluster.py", line 1539, in stop await self.clusters.put(self.cluster, is_dirty=True) File "/home/avi/scylla-maint/test/pylib/pool.py", line 104, in put await self.destroy(obj) File "/home/avi/scylla-maint/test/pylib/suite/python.py", line 65, in recycle_cluster srv.log_file.close() ^^^^^^^^^^^^ AttributeError: 'ScyllaServer' object has no attribute 'log_file' ``` Closes scylladb/scylladb#24885	2025-07-22 12:39:01 +02:00
Avi Kivity	2db2b42556	sstables: version: drop custom operator<=> The default comparison for enums is equivalent and sufficient. Closes scylladb/scylladb#24888	2025-07-22 12:39:01 +02:00
Avi Kivity	e89f6c5586	config, main: make cpu scheduling mandatory CPU scheduling has been with us since `641aaba12c` (2017), and no one ever disables it. Likely nothing really works without it. Make it mandatory and mark the option unused. Closes scylladb/scylladb#24894	2025-07-22 12:39:01 +02:00
Avi Kivity	ee138217ba	alternator: simplify std::views::transform calls that extract a member from a class Rather than calling std::views::transform with a lambda that extracts a member from a class, call std::views::transform with a pointer-to-member to do the same thing. This results in more concise code. Closes scylladb/scylladb#25012	2025-07-22 12:39:01 +02:00
Jakub Smolar	6e0a063ce3	gdb: handle zero-size reads in managed_bytes Fixes: https://github.com/scylladb/scylladb/issues/25048 Closes scylladb/scylladb#25050	2025-07-22 12:39:01 +02:00
Nadav Har'El	298a0ec4de	test/cqlpy: in README.md, remind users of run-cassandra to set NODETOOL test/cqlpy/README.md explains how to run the cqlpy tests against Cassandra, and mentions that if you don't have "nodetool" in your path you need to set the NODETOOL variable. However, when giving a simple example how to use the run-cassandra script, we forgot to remind the user to set NODETOOL in addition to CASSANDRA, causing confusion for users who didn't know why tests were failing. So this patch fixes the section in test/cqlpy/README.md with the run-cassandra example to also set the NODETOOL environment variable, not just CASSANDRA. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25051	2025-07-22 12:39:00 +02:00
Aleksandra Martyniuk	b5026edf49	tasks: change _finished_children type Parent task keeps a vector of statuses (task_essentials) of its finished children. When the children number is large - for example because we have many tables and a child task is created for each table - we may hit oversize allocation while adding a new child essentials to the vector. Keep task_essentails of children in chunked_vector. Fixes: #25040. Closes scylladb/scylladb#25064	2025-07-22 12:39:00 +02:00
Pavel Emelyanov	d94be313c1	Merge 'test: audit: ignore cassandra user audit logs in AUTH tests' from Andrzej Jackowski Audit tests are vulnerable to noise from LOGIN queries (because AUTH audit logs can appear at any time). Most tests already use the `filter_out_noise` mechanism to remove this noise, but tests focused on AUTH verification did not, leading to sporadic failures. This change adds a filter to ignore AUTH logs generated by the default "cassandra" user, so tests only verify logs from the user created specifically for each test. Additionally, this PR: - Adds missing `nonlocal new_rows` statement that prevented some checks from being called - Adds a testcase for audit logs of `cassandra` user Fixes: https://github.com/scylladb/scylladb/issues/25069 Better backport those test changes to 2025.3. 2025.2 and earlier don't have `./cluster/dtest/audit_test.py`. Closes scylladb/scylladb#25111 * github.com:scylladb/scylladb: test: audit: add cassandra user test case test: audit: ignore cassandra user audit logs in AUTH tests test: audit: change names of `filter_out_noise` parameters test: audit: add missing `nonlocal new_rows` statement	2025-07-22 10:42:16 +03:00
Pavel Emelyanov	295165d8ea	Merge 's3_client: Enhance s3_client error handling' from Ernest Zaslavsky Enhance and fix error handling in the `chunked_download_source` to prevent errors seeping from the request callback. Also stop retrying on seastar's side since it is going to break the integrity of data which maybe downloaded more than once for the same range. Fixes: https://github.com/scylladb/scylladb/issues/25043 Should be backported to 2025.3 since we have an intention to release native backup/restore feature Closes scylladb/scylladb#24883 * github.com:scylladb/scylladb: s3_client: Disable Seastar-level retries in HTTP client creation s3_test: Validate handling of non-`aws_error` exceptions s3_client: Improve error handling in chunked_download_source aws_error: Add factory method for `aws_error` from exception	2025-07-22 10:40:39 +03:00
Ran Regev	dd67d22825	nodetool restore: sstable list from a file Fixes: #25045 added the ability to supply the list of files to restore from the a given file. mainly required for local testing. Signed-off-by: Ran Regev <ran.regev@scylladb.com> Closes scylladb/scylladb#25077	2025-07-22 09:11:02 +03:00
Pavel Emelyanov	52455f93b6	gms,init: Move get_disabled_features_from_db_config() from gms Now when all callers are decoupled from gms config generating code, the latter can be decoupled from the db::config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-21 19:20:17 +03:00
Pavel Emelyanov	8220974e76	code: Update callers generating feature service config Instead of requesting it from gms code, create it "by hand" with the help of get_disabled_features_from_db_config() method. This is how other services are configured by main/tools/testing code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-21 19:19:09 +03:00
Pavel Emelyanov	0808e65b4e	gms: Make feature_config a simple struct All config-s out there are plan structures without private members and methods used to simply carry the set of config values around. Make the feature service config alike. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-21 19:17:59 +03:00
Pavel Emelyanov	f703fb9b2d	gms: Split feature_config_from_db_config() into two The helper in question generates the disabled features set and assigns one on the config. This patch detaches the features set generation into an other function. The former will go away eventually and the latter will be kept around main/test code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-21 19:16:40 +03:00
Ernest Zaslavsky	fc2c9dd290	s3_client: Disable Seastar-level retries in HTTP client creation Prevent Seastar from retrying HTTP requests to avoid buffer double-feed issues when an entire request is retried. This could cause data corruption in `chunked_download_source`. The change is global for every instance of `s3_client`, but it is still safe because: * Seastar's `http_client` resets connections regardless of retry behavior * `s3_client` retry logic handles all error types—exceptions, HTTP errors, and AWS-specific errors—via `http_retryable_client`	2025-07-21 17:03:23 +03:00
Ernest Zaslavsky	ba910b29ce	s3_test: Validate handling of non-`aws_error` exceptions Inject exceptions not wrapped in `aws_error` from request callback lambda to verify they are properly caught and handled.	2025-07-21 16:52:43 +03:00
Ernest Zaslavsky	b7ae6507cd	s3_client: Improve error handling in chunked_download_source Create aws_error from raised exceptions when possible and respond appropriately. Previously, non-aws_exception types leaked from the request handler and were treated as non-retryable, causing potential data corruption during download.	2025-07-21 16:49:47 +03:00
Ernest Zaslavsky	d53095d72f	aws_error: Add factory method for `aws_error` from exception Move `aws_error` creation logic out of `retryable_http_client` and into the `aws_error` class to support reuse across components.	2025-07-21 16:42:44 +03:00
Andrzej Jackowski	21aedeeafb	test: audit: add cassandra user test case Audit tests use the `filter_out_noise` function to remove noise from audit logs generated by user authentication. As a result, none of the existing tests covered audit logs for the default `cassandra` user. This change adds a test case for that user. Refs: scylladb/scylladb#25069	2025-07-21 14:54:20 +02:00
Andrzej Jackowski	aef6474537	test: audit: ignore cassandra user audit logs in AUTH tests Audit tests are vulnerable to noise from LOGIN queries (because AUTH audit logs can appear at any time). Most tests already use the `filter_out_noise` mechanism to remove this noise, but tests focused on AUTH verification did not, leading to sporadic failures. This change adds a filter to ignore AUTH logs generated by the default "cassandra" user, so tests only verify logs from the user created specifically for each test. Fixes: scylladb/scylladb#25069	2025-07-21 14:54:20 +02:00
Andrzej Jackowski	daf1c58e21	test: audit: change names of `filter_out_noise` parameters This is a refactoring commit that changes the names of the parameters of the `filter_out_noise` function, as well as names of related variables. The motiviation for the change is introduction of more complex filtering logic in next commit of this patch series. Refs: scylladb/scylladb#25069	2025-07-21 14:54:01 +02:00
Andrzej Jackowski	e634a2cb4f	test: audit: add missing `nonlocal new_rows` statement The variable `new_rows` was not updated by the inner function `is_number_of_new_rows_correct` because the `nonlocal new_rows` statement was missing. As a result, `sorted_new_rows` was empty and certain checks were skipped. This change: - Introduces the missing `nonlocal new_rows` declaration - Adds an assertion verifying that the number of new rows matches the expected count - Fixes the incorrect variable name in the lambda used for row sorting	2025-07-21 14:53:48 +02:00
Pavel Emelyanov	339f08b24a	scripts: Enhance refresh_submodules.sh with nested summary Currently when refreshing submodule, the script puts a plain list of non-merge commits into commit message. The resulting summary contains everything, but is hard to understand. E.g. if updating seastar today the summary would start with * seastar 26badcb1...86c4893b (55): > util: make SEASTAR_ASSERT() failure generate SIGABRT > core: fix high CPU use at idle on high core count machines > http::reply: Add 308 (permanent redirect) and make pretty-print handle unknown values > reactor: Relax friendship with file_data_source_impl > fstream: Use direct io_stats reference > thread_pool: Relax coupling with reactor > reactor: Mark some IO classes management methods private > http: Deprecate json_exception > fair_queue: Move io_throttler to io_queue.hh > fair_queue: Move metrics from to io_queue::stream > fair_queue: Remove io_throttler from tests > fair_queue_test: Remove io-throttler from fair-queue > fair_queue: Remove capacity getters > fair_queue: Move grab_result into io_queue::stream too > fair_queue: Move throtting code to io_queue.cc > fair_queue: Move throttling code to io_queue::stream class > fair_queue: Open-code dispatch_requests() into users > fair_queue: Split dispatch_requests() into top() and pop_front() > fair_queue: Swap class push back and dispatch > fair_queue: Configure forgiving factor externally ... That's not very informative, because the update includes several large "merges" that have their summary which is missing here. This update changes the way summary is generated to include merges and their summaries and all merged commits are listed as sub-lines, like this * seastar 26badcb1...86c4893b (26): > util: make SEASTAR_ASSERT() failure generate SIGABRT > core: fix high CPU use at idle on high core count machines > Merge 'Move output IO throttler to IO queue level' from Pavel Emelyanov fair_queue: Move io_throttler to io_queue.hh fair_queue: Move metrics from to io_queue::stream fair_queue: Remove io_throttler from tests fair_queue_test: Remove io-throttler from fair-queue fair_queue: Remove capacity getters fair_queue: Move grab_result into io_queue::stream too fair_queue: Move throtting code to io_queue.cc fair_queue: Move throttling code to io_queue::stream class fair_queue: Open-code dispatch_requests() into users fair_queue: Split dispatch_requests() into top() and pop_front() fair_queue: Swap class push back and dispatch fair_queue: Configure forgiving factor externally fair_queue: Move replenisher kick to dispatch caller io_queue: Introduce io_queue::stream fair_queue: Merge two grab_capacity overloads fair_queue: Detatch outcoming capacity grabbing from main dispatch loop fair_queue: Move available tokens update into if branch io_queue: Rename make_fair_group_config into configure_throttler io_queue: Rename get_fair_group into get_throttler fair_queue: Rename fair_group -> io_throttler > http::reply: Add 308 (permanent redirect) and make pretty-print handle unknown values > Merge 'Relax reactor coupling with file_data_source_impl' from Pavel Emelyanov reactor: Relax friendship with file_data_source_impl fstream: Use direct io_stats reference > thread_pool: Relax coupling with reactor > reactor: Mark some IO classes management methods private ... Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24834	2025-07-21 14:48:30 +03:00
Ernest Zaslavsky	0053a4f24a	encryption: remove default case from component_type switch Do not use default, instead list all fall-through components explicitly, so if we add a new one, the developer doing that will be forced to consider what to do here. Eliminate the `default` case from the switch in `encryption_file_io_extension::wrap_sink`, and explicitly handle all `component_type` values within the switch statement. fixes: https://github.com/scylladb/scylladb/issues/23724 Closes scylladb/scylladb#24987	2025-07-21 14:43:12 +03:00
Ernest Zaslavsky	408aa289fe	treewide: Move misc files to `utils` directory As requested in #22114, moved the files and fixed other includes and build system. Moved files: - interval.hh - Map_difference.hh Fixes: #22114 This is a cleanup, no need to backport Closes scylladb/scylladb#25095	2025-07-21 11:56:40 +03:00
Piotr Dulikowski	7fd97e6a93	Merge 'cdc: Forbid altering columns of CDC log tables directly' from Dawid Mędrek The set of columns of a CDC log table should be managed automatically by Scylla, and the user should not have the ability to manipulate them directly. That could lead to disastrous consequences such as a segmentation fault. In this commit, we're restricting those operations. We also provide two validation tests. One of the existing tests had to be adjusted as it modified the type of a column in a CDC log table. Since the test simply verifies that the user has sufficient permissions to perform `ALTER TABLE` on the log table, the test is still valid. Fixes scylladb/scylladb#24643 Backport: we should backport the change to all affected branches to prevent the consequences that may affect the user. Closes scylladb/scylladb#25008 * github.com:scylladb/scylladb: cdc: Forbid altering columns of inactive CDC log table cdc: Forbid altering columns of CDC log tables directly	2025-07-21 09:31:00 +02:00
Ran Regev	bb95ac857e	enable_set: fix separator formatting from space comma to comma space For better log readability. Fixes: #23883 Closes scylladb/scylladb#24647	2025-07-20 19:12:57 +03:00
Avi Kivity	3dfdcf7d7a	Merge 'transport: remove throwing `protocol_exception` on connection start' from Dario Mirovic `protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future. This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future. There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it. Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance. In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test. Testing Build: `release` Test file: `test/cqlpy/test_protocol_exceptions.py` Test name: `test_protocol_version_mismatch` (modified for mass connection requests) Test arguments: ``` max_attempts=100'000 num_parallel=10 ``` Throwing `protocol_exception` results: ``` real=1:26.97 user=10:00.27 sys=2:34.55 cpu=867% real=1:26.95 user=9:57.10 sys=2:32.50 cpu=862% real=1:26.93 user=9:56.54 sys=2:35.59 cpu=865% real=1:26.96 user=9:54.95 sys=2:32.33 cpu=859% real=1:26.96 user=9:53.39 sys=2:33.58 cpu=859% real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862% # average ``` Returning `protocol_exception` as `result_with_exception` or an exceptional future: ``` real=1:18.46 user=9:12.21 sys=2:19.08 cpu=881% real=1:18.44 user=9:04.03 sys=2:17.91 cpu=869% real=1:18.47 user=9:12.94 sys=2:19.68 cpu=882% real=1:18.49 user=9:13.60 sys=2:19.88 cpu=883% real=1:18.48 user=9:11.76 sys=2:17.32 cpu=878% real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879% # average ``` This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567. Refs: #24567 This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting. Closes scylladb/scylladb#24738 * github.com:scylladb/scylladb: test/cqlpy: add cpp exception metric test conditions transport/server: replace protocol_exception throws with returns utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception transport/server: avoid exception-throw overhead in handle_error test/cqlpy: add protocol_exception tests	2025-07-20 17:42:30 +03:00
Dawid Mędrek	59800b1d66	cdc: Forbid altering columns of inactive CDC log table When CDC becomes disabled on the base table, the CDC log table still exsits (cf. scylladb/scylladb@adda43edc7). If it continues to exist up to the point when CDC is re-enabled on the base table, no new log table will be created -- instead, the old olg table will be re-attached. Since we want to avoid situations when the definition of the log table has become misaligned with the definition of the base table due to actions of the user, we forbid modifying the set of columns or renaming them in CDC log tables, even when they're inactive. Validation tests are provided.	2025-07-18 15:03:08 +02:00
Piotr Dulikowski	85e506dab5	Merge 'test.py: print warning when no tests found' from Andrei Chekun Quit from the repeats if the test is under the pytest runner directory and has some typos or is absent. This allows not going several times through the discovery and stopping execution. Print a warning at the end of the run when no tests were selected by provided name. Fixes: scylladb/scylladb#24892 Closes scylladb/scylladb#24918 * github.com:scylladb/scylladb: test.py: print warning in case no tests were found test.py: break the loop when there is no tests for pytest	2025-07-18 10:26:44 +02:00
Piotr Dulikowski	fd6e14f3ab	Merge 'cdc: throw error if column doesn't exist' from Michael Litvak in the CDC log transformer, when creating a CDC mutation based on some base table mutation, for each value of a base column we set the value in the CDC column with the same name. When looking up the column in the CDC schema by name, we may get a null pointer if a column by that name is not found. This shouldn't happen normally because the base schema and CDC schema should be compatible, and for each base column there should be a CDC column with the same name. However, there are scenarios where the base schema and CDC schema are incompatible for a short period of time when they are being altered. When a base column is being added or dropped, we could get a base mutation with this column set, and then the CDC transformer picks up the latest CDC schema which doesn't have this column. If such thing happens, we fix the code to throw an exception instead of crashing on null pointer dereference. Currently we don't have a safer approach to handle this, but this might be changed in the future. The other alternative is dropping that data silently which we prefer not to do. Throwing an error is acceptable because this scenario most likely indicates this behavior by the user: * The user adds a new column, and start writing values to the column before the ALTER is complete. or, * The user drops a column, and continues writing values to the column while it's being dropped. Both cases might as well fail with an error because the column is not found in the base table. Fixes scylladb/scylladb#24952 backport needed - simple fix for a node crash Closes scylladb/scylladb#24986 * github.com:scylladb/scylladb: test: cdc: add test_cdc_with_alter cdc: throw error if column doesn't exist	2025-07-18 09:40:56 +02:00
Dawid Mędrek	bea7c26d64	test/cqlpy/test_describe.py: Adjust test_create_role_with_hashed_password_authorization to work with Cassandra We adjust test_create_role_with_hashed_password_authorization to work with both Scylla and Cassandra. For some reason (probably a bug), Cassandra requires that the `LOGIN` property of a role come before the password.	2025-07-17 22:18:12 +02:00
Dawid Mędrek	55c22f864e	test/cqlpy/test_describe.py: Adjust test_desc_restore to work with Cassandra Cassandra doesn't use service levels, and it doesn't include auth in the output of `DESCRIBE SCHEMA`. It doesn't support the form of the statement `... WITH PASSWORDS`. UDFs in Cassandra don't support Lua. That's why the test didn't work against Cassandra. In this commit, we adjust it to work with both Scylla and Cassandra.	2025-07-17 22:17:15 +02:00
Dawid Mędrek	fca03ca915	test/cqlpy/test_describe.py: Mark Scylla-only tests as such Tests verifying that auth and service levels are part of the output of `DESCRIBE SCHEMA` were not marked as `scylla_only` when they were written, but they're a feature only Scylla has. Because of that, let's mark them with `scylla_only` so they're not run against Cassandra to avoid unnecessary failures. We also provide a short explanation for each test why it's marked that way.	2025-07-17 21:45:44 +02:00
Andrei Chekun	04b0fba88c	test.py: print warning in case no tests were found Print a warning at the end of the run when no tests were selected by provided name. Fixes: https://github.com/scylladb/scylladb/issues/24892	2025-07-17 19:51:22 +02:00
Michael Litvak	86dfa6324f	test: cdc: add test_cdc_with_alter Add a test that tests adding and dropping a column to a table with CDC enabled while writing to it.	2025-07-17 17:16:17 +02:00
Michael Litvak	b336f282ae	cdc: throw error if column doesn't exist in the CDC log transformer, when creating a CDC mutation based on some base table mutation, for each value of a base column we set the value in the CDC column with the same name. When looking up the column in the CDC schema by name, we may get a null pointer if a column by that name is not found. This shouldn't happen normally because the base schema and CDC schema should be compatible, and for each base column there should be a CDC column with the same name. However, there are scenarios where the base schema and CDC schema are incompatible for a short period of time when they are being altered. When a base column is being added or dropped, we could get a base mutation with this column set, and then the CDC transformer picks up the latest CDC schema which doesn't have this column. If such thing happens, we fix the code to throw an exception instead of crashing on null pointer dereference. Currently we don't have a safer approach to handle this, but this might be changed in the future. The other alternative is dropping that data silently which we prefer not to do. Throwing an error is acceptable because this scenario most likely indicates this behavior by the user: * The user adds a new column, and start writing values to the column before the ALTER is complete. or, * The user drops a column, and continues writing values to the column while it's being dropped. Both cases might as well fail with an error because the column is not found in the base table. Fixes scylladb/scylladb#24952	2025-07-17 17:16:17 +02:00
Dario Mirovic	4a6f71df68	test/cqlpy: add cpp exception metric test conditions Tested code paths should not throw exceptions. `scylla_reactor_cpp_exceptions` metric is used. This is a global metric. To address potential test flakiness, each test runs multiple times: - `run_count = 100` - `cpp_exception_threshold = 10` If a change in the code introduced an exception, expectation is that the number of registered exceptions will be > `cpp_exception_threshold` in `run_count` runs. In which case the test fails.	2025-07-17 17:02:48 +02:00
Dario Mirovic	5390f92afc	transport/server: replace protocol_exception throws with returns Replace throwing protocol_exception with returning it as a result or an exceptional future in the transport server module. This improves performance, for example during connection storms and server restarts, where protocol exceptions are more frequent. In functions already returning a future, protocol exceptions are propagated using an exceptional future. In functions not already returning a future, result_with_exception is used. Notable change is checking v.failed() before calling v.get() in process_request function, to avoid throwing in case of an exceptional future. Refs: #24567	2025-07-17 16:54:05 +02:00
Dario Mirovic	9f4344a435	utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception Make make_bytes_ostream and make_fragmented_temporary_buffer accept writer callbacks that return utils::result_with_exception instead of forcing them to throw on error. This lets callers propagate failures by returning an error result rather than throwing an exception. Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer concepts to simplify and document the template requirements on writer callbacks. This patch does not modify the actual callbacks passed, except for the syntax changes needed for successful compilation, without changing the logic. Refs: #24567	2025-07-17 16:40:02 +02:00
Dario Mirovic	30d424e0d3	transport/server: avoid exception-throw overhead in handle_error Previously, connection::handle_error always called f.get() inside a try/catch, forcing every failed future to throw and immediately catch an exception just to classify it. This change eliminates that extra throw/catch cycle by first checking f.failed(), getting the stored std::exception_ptr via f.get_exception(), and then dispatching on its type via utils::try_catch<T>(eptr). The error-response logic is not changed - cassandra_exception, std::exception, and unknown exceptions are caught and processed, and any exceptions thrown by write_response while handling those exceptions continues to escape handle_error. Refs: #24567	2025-07-17 16:40:02 +02:00
Dario Mirovic	7aaeed012e	test/cqlpy: add protocol_exception tests Add a helper to fetch scylla_transport_cql_errors_total{type="protocol_error"} counter from Scylla's metrics endpoint. These metrics are used to track protocol error count before and after each test. Add cql_with_protocol context manager utility for session creation with parameterized protocol_version value. This is used for testing connection establishment with different protocol versions, and proper disposal of successfully established sessions. The tests cover two failure scenarios: - Protocol version mismatch in test_protocol_version_mismatch which tests both supported and unsupported protocol version - Malformed frames via raw socket in _protocol_error_impl, used by several test functions, and also test_no_protocol_exceptions test to assert that the error counters never decrease during test execution, catching unintended metric resets Refs: #24567	2025-07-17 16:39:54 +02:00
Petr Gusev	2027856847	Revert "paxos_state: read repair for intranode_migration" This reverts commit `45f5efb9ba`. The load_and_repair_paxos_state function was introduced in scylladb/scylladb#24478, but it has never been tested or proven useful. One set of problems stems from its use of local data structures from a remote shard. In particular, system_keyspace and schema_ptr cannot be directly accessed from another shard — doing so is a bug. More importantly, load_paxos_state on different shards can't ever return different values. The actual shard from which data is read is determined by sharder.shard_for_reads, and storage_proxy will jump back to the appropriate shard if the current one doesn't match. This means load_and_repair_paxos_state can't observe paxos state from write-but-not-read shard, and therefore will never be able to repair anything. We believe this explicit Paxos state read-repair is not needed at all. Any paxos state read which drives some paxos round forward is already accompanied by a paxos state write. Suppose we wrote the state to the old shard but not to the new shard (because of some error) while streaming is already finished. The RPC call (prepare or accept) will return error to the coordinator, such replica response won't affect the current round. This write won't affect any subsequent paxos rounds either, unless in those rounds the write actually succeeds on both shards, effectively 'auto-repairing' paxos state. Same if we managed to write to the new shard but not to the old shard. Any subsequent reads will observe either the old state or the new state (if the tablet already switched reads to the new shard). In any case, we'll have to write the state to all relevant shards from sharder.shard_for_writes (one or two) before sending rpc response, making this state visible for all subsequent reads. Thus, the monotonicity property ("once observed, the state must always be observed") appears to hold without requiring explicit read-repair and load_and_repair_paxos_state is not needed. Closes scylladb/scylladb#24926	2025-07-17 14:00:43 +02:00
Botond Dénes	20693edb27	Merge 'sstables: put index_reader behind a virtual interface' from Michał Chojnowski This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to). In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface. Later, we will add BTI indexes which will also implement this interface. This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`. Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes. No backports needed, this is a preparation for new functionality. Closes scylladb/scylladb#25000 * github.com:scylladb/scylladb: sstables: add sstable::make_index_reader() and use where appropriate sstables/mx: in readers, use abstract_index_reader instead of index_reader sstables: in validate(), use abstract_index_reader instead of index_reader where possible test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader sstables/index_reader: introduce abstract_index_reader sstables/index_reader: extract a prefetch_lower_bound() method	2025-07-17 14:32:08 +03:00
Nadav Har'El	04b263b51a	Merge 'vector_index: do not create a view when creating a vector index' from Michał Hudobski This PR adds a way for custom indexes to decide whether a view should be created for them, as for the vector_index the view is not needed, because we store it in the external service. To allow this, custom logic for describing indexes using custom classes was added (as it used to depend on the view corresponding to an index). Fixes: VECTOR-10 Closes scylladb/scylladb#24438 * github.com:scylladb/scylladb: custom_index: do not create view when creating a custom index custom_index: refactor describe for custom indexes custom_index: remove unneeded duplicate of a static string	2025-07-17 13:48:49 +03:00
Michał Chojnowski	4e4a4b6622	sstables: add sstable::make_index_reader() and use where appropriate If we add multiple index implementations, users of index readers won't easily know which concrete index reader type is the right one to construct. We also don't want pieces of code to depend on functionality specific to certain concrete types, if that's not necessary. So instead of constructing the readers by themselves, they can use a helper function, which will return an abstract (virtual) index reader. This patch adds such a function, as a method of `sstable`.	2025-07-17 10:32:57 +02:00
Michał Chojnowski	1c4065e7dd	sstables/mx: in readers, use abstract_index_reader instead of index_reader This makes clear which methods of index_reader are available for use by sstable readers, and which aren't.	2025-07-17 10:32:57 +02:00
Michał Chojnowski	efcf3f5d66	sstables: in validate(), use abstract_index_reader instead of index_reader where possible After we add a second index implementation, we will probably want to adjust validate() to work with either implementation. Some validations will be format-specific, but some will be common. For now, let's use abstract_index_reader for the validations which can be done through that interface, and let's have downcast-specific codepaths for the others. Note: we change a `get_data_file_position()` call to `data_file_positions().start`. The call happens at the beginning of a partition, and at this points these two expressions are supposed to be equivalent.	2025-07-17 10:32:57 +02:00
Michał Chojnowski	92219a5ef8	test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader We don't want tests to create the concrete `index_reader` directly. We would like them to be able to test both sstables which use `index_reader`, and those which will use the planned new index implementation. So we will let the tests construct an abstract_index_reader and pass it to the index_reader_assertions, which will be able to assert the requested properties on various implementations as it wants.	2025-07-17 10:32:56 +02:00
Michał Chojnowski	c052ccd081	sstables/index_reader: introduce abstract_index_reader We want to implement BTI indexes in Scylla. After we do that, some sstables will use a BTI index reader, while others will use the old BIG index reader. To handle that, we can expose a common virtual "index reader" interface to sstable readers. This is what this patch does. This interface can't be quite fully implemented by a BTI index, because some methods returns keys which a BIG index stores, but a BTI index doesn't. So it will be further restricted in future patches. But for now, we only extract all methods currently used by the readers to a virtual interface.	2025-07-17 10:32:56 +02:00
Botond Dénes	fd6877c654	Merge 'alternator: avoid oversized allocation in Query/Scan' from Nadav Har'El This series fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator. The first patch in the series is the main fix - the later patches are cleanups requested by reviewers but also involved other pre-existing code, so I did those cleanups as separate patches. Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page. In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running test/alternator/run --runveryslow \ test_query.py::test_query_large_page_small_rows reports in the log: oversized allocation: 573440 bytes. After this patch, this warning no longer appears. The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that. The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before. Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test. Fixes #23535 The stalls caused by large allocations was seen by actual users, so it makes sense to backport this patch. On the other hand, the patch while not big is fairly intrusive (modifies the nomal Scan and Query path and also the later patches do some cleanup of additional code) so there is some small risk involved in the backport. Closes scylladb/scylladb#24480 * github.com:scylladb/scylladb: alternator: clean up by co-routinizing alternator: avoid spamming the log when failing to write response alternator: clean up and simplify request_return_type alternator: avoid oversized allocation in Query/Scan	2025-07-17 11:30:40 +03:00
Calle Wilund	5dd871861b	tests::proc::process_fixture: Fix line handler adaptor buffering Fixes #24998 Helper routine translating input_stream buffers to single lines did not loop over current buffer state, leading to only the first line being sent to end listener. Rewrote to use range iteration instead. Nicer. Closes scylladb/scylladb#24999	2025-07-17 10:58:03 +03:00
Ernest Zaslavsky	342e94261f	s3_client: parse multipart response XML defensively Ensure robust handling of XML responses when initiating multipart uploads. Check for the existence of required nodes before access, and throw an exception if the XML is empty or malformed. Refs: https://github.com/scylladb/scylladb/issues/24676 Closes scylladb/scylladb#24990	2025-07-17 10:55:04 +03:00
Botond Dénes	054ea54565	Merge 'streaming: Avoid deadlock by running view checks in a separate scheduling group' from Tomasz Grabiec This issue happens with removenode, when RBNO is disabled, so range streamer is used. The deadlock happens in a scenario like this: 1. Start 3 nodes: {A, B, C}, RF=2 2. Node A is lost 3. removenode A 4. Both B and C gain ownership of ranges. 5. Streaming sessions are started with crossed directions: B->C, C->B Readers created by sender side exhaust streaming semaphore on B and C. Receiver side attempts to obtain a permit indirectly by calling check_needs_view_update_path(), which reads local tables. That read is blocked and times-out, causing streaming to fail. The streaming writer is already using a tracking-only permit. Even if we didn't deadlock, and the streaming semaphore was simply exhausted by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation. To avoid that, run the query under a different scheduling group, which translates to the system semaphore instead of the maintenance semaphore, to break the dependency. The gossip group was chosen because it shouldn't be contended and this change should not interfere with it much. Fixes #24807 Fixes #24925 Closes scylladb/scylladb#24929 * github.com:scylladb/scylladb: streaming: Avoid deadlock by running view checks in a separate scheduling group service: migration_manager: Run group0 barrier in gossip scheduling group	2025-07-17 10:24:41 +03:00
Botond Dénes	4c832d583e	Merge 'repair: Speed up ranges calculation when small table optimization is on' from Asias He repair: Speed up ranges calculation when small table optimization is on Normally, during bootstrap, in repair_service::bootstrap_with_repair, we need to calculate which range to sync data from carefully for the new node. With small table optimization on, we pass a single full range and all peer nodes to row level repair to sync data with. Now that we only need to pass a single range and full peers, there is no need to calculate the ranges and peers in repair_service::bootstrap_with_repair and drop it later. The calculation takes time which slows down bootstrap, e.g., ``` Jul 08 22:01:41.927785 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]: [shard 0:strm] repair - bootstrap_with_repair: started with keyspace=system_distributed_everywhere, nr_ranges=23809 Jul 08 22:01:57.883797 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]: [shard 0:strm] repair - repair[79eac1a1-5d5b-4028-ae1c-06e68bec2d50]: sync data for keyspace=system_distributed_everywhere, status=started, reason=bootstrap, small_table_optimization=true ``` The range calculation took 15 seconds for system_distributed_everywhere table. To fix, the ranges calculation is skipped if small table optimization is on for the keyspace. Before: cluster dev [ PASS ] cluster.test_boot_nodes.1 104.59s After: cluster dev [ PASS ] cluster.test_boot_nodes.1 89.23s A 15% improvement to bootstrap 30 node cluster was observed. Fixes #24817 Closes scylladb/scylladb#24901 * github.com:scylladb/scylladb: repair: Speed up ranges calculation when small table optimization is on test: Add test_boot_nodes.py	2025-07-17 10:23:45 +03:00
Nikos Dragazis	88554b7c7a	docs: Document the Azure Key Provider Extend the EaR ops guide to incorporate the new Azure Key Provider. Document its options and provide instructions on how to configure it. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 23:06:11 +03:00
Nikos Dragazis	09dcdebca3	test: Add tests for Azure Key Provider The tests cover a variety of scenarios, including: * Authentication with client secrets, client certificates, and IMDS. * Valid and invalid encryption options in the configuration and table schema. * Common error conditions such as insufficient permissions, non-existent keys and network errors. All tests run against a local mock server by default. A subset of the tests can also against real Azure services if properly configured. The tests that support real Azure services were kept to a minimum to cover only the most basic scenarios (success path and common error conditions). Running the tests with real resources requires parameterizing them with env vars: * ENABLE_AZURE_TEST - set to non-zero (1/true) to run Azure tests (enabled by default) * ENABLE_AZURE_TEST_REAL - set to non-zero (1/true) to run against real Azure services * AZURE_TENANT_ID - the tenant where the principals live * AZURE_USER_1_CLIENT_ID - the client ID of user1 * AZURE_USER_1_CLIENT_SECRET - the secret of user1 * AZURE_USER_1_CLIENT_CERTIFICATE - the PEM-encoded certificate and private key of user1 * AZURE_USER_2_CLIENT_ID - the client ID of user2 * AZURE_USER_2_CLIENT_SECRET - the secret of user2 * AZURE_USER_2_CLIENT_CERTIFICATE - the PEM-encoded certificate and private key of user2 * AZURE_KEY_NAME - set to <vault_name>/<keyname> User1 is assumed to have permissions to wrap/unwrap using the given key. User2 is assumed to not have permissions for these operations. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 23:06:01 +03:00
Nikos Dragazis	083aabe0c6	pylib: Add mock server for Azure Key Vault The Azure Key Provider depends on three Azure services: - Azure Key Vault - IMDS - Entra STS To enable local testing, introduce a mock server that offers all the needed APIs from these services. The server also offers an error injection endpoint to configure a particular service to respond with some error code for a number of consecutive requests. The server is integrated as a 3rd party service in test.py. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:09 +03:00
Nikos Dragazis	41b63469e1	encryption: Define and enable Azure Key Provider Define the Azure Key Provider to connect the core EaR business logic with the Azure-based Key Management implementation (Azure host). Introduce "AzureKeyProviderFactory" as a new `key_provider` value in the configuration. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:09 +03:00
Nikos Dragazis	f0927aac07	encryption: azure: Delegate hosts to shard 0 As in the AWS and GCP hosts, make all Azure hosts delegate their traffic to shard 0 to avoid creating too many data encryption keys and API calls to Key Vault. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:09 +03:00
Nikos Dragazis	339992539d	encryption: Add Azure host cache The encryption context maintains a cache per host type per thread. Add a cache for the Azure host as well. Initialize the cache with Azure hosts from the configuration, while registering the extensions for encryption. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:09 +03:00
Nikos Dragazis	c98d3246b2	encryption: Add config options for Azure hosts Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:09 +03:00
Nikos Dragazis	a1aef456ac	encryption: azure: Add override options Extend `get_or_create_key()` to accept host options that override the config options. This will be used to pass encryption options from the table schema. Currently, only the master key can be overridden. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:09 +03:00
Nikos Dragazis	5ba6ca0992	encryption: azure: Add retries for transient errors Inject a few fast retries to quickly recover from short-lived transient errors. If a request is unauthorized, retry with no delay, since it may be caused by expired tokens. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	d4dcdcd46c	encryption: azure: Implement init() Implement the `azure_host::init()` API that performs the async initialization of the host. Since the Azure host has no state that needs to be initialized, just verify that we have access to the Vault key. This will cause the system to fail earlier if not properly configured (e.g., the key does not exist, the credentials have insufficient permissions, etc.). Do not run any verification steps if no master key is configured in `scylla.yaml`. The master key can be specified later or overridden through the encryption options in table schema. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	1e519ba329	encryption: azure: Implement get_key_by_id() Implement the `azure_host::get_key_by_id()` API, which retrieves a data encryption key from a key ID. Use a loading cache to reduce the API calls to Key Vault. When the cache needs to refresh or reload a key, extract the ciphertext from the key ID and unwrap it with the Vault key that is also encoded in the key ID. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	7938096142	encryption: azure: Add id-based key cache Add a cache to store data encryption keys based on their IDs. This will be plugged into `get_key_by_id()` in a later patch to avoid unwrapping keys that have been encountered recently, thereby reducing the API calls to Key Vault. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	470513b433	encryption: azure: Implement get_or_create_key() Implement the `azure_host::get_or_create_key()` API, which returns a data encryption key for a given algorithm descriptor (cipher algorithm and key length). Use a loading cache to reduce the API calls to Key Vault. When the cache needs to refresh or reload a key, always create a new one and wrap it with the Vault key. For the REST API calls to Key Vault, use an ephemeral HTTP client and configure it to not wait for the server's response when terminating a TLS connection. Although the TLS protocol requires clients to wait on the server's response to a close_notify alert, the Key Vault service ignores this, causing the client to block for 10 seconds (hardcoded) before timing out. Use the following identifier for each key: <vault name>/<key name>/<key version>:<base64 encoded ciphertext of data encryption key> The key version is required to support Vault key rotations. Finally, define an exception for Vault errors. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	e76187fb6d	encryption: azure: Add credentials in Azure host The Azure host needs credentials to communicate with Key Vault. First search for credentials in the host options, and then fall back to default credentials if the former are non-existent or incomplete. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	457c90056d	encryption: azure: Add attribute-based key cache Add a cache to store data encryption keys based on their attributes (cipher algorithm + key length). This will be plugged into `get_or_create_key()` in a later patch to reuse the same keys in multiple requests, thereby reducing the API calls to Key Vault. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	b39d1b195e	encryption: azure: Add skeleton for Azure host The Azure host manages cryptographic keys using Azure Key Vault. This patch only defines the API. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	e078abba57	encryption: Templatize get_{kmip,kms,gcp}_host() For deduplication. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	b1e719c531	encryption: gcp: Fix typo in docstring Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	eec49c4d78	utils: azure: Get access token with default credentials Attempt to detect credentials from the system. Inspired from the `DefaultAzureCredential` in the Azure C++ SDK, this credential type detects credentials from the following sources (in this order): * environment variables (SP credentials - same variables as in Azure C++ SDK) * Azure CLI * IMDS Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	937d6261c0	utils: azure: Get access token from Azure CLI Implement token request with Azure CLI. Inspired from the Azure C++ SDK's `AzureCliCredential`, this credential type attempts to run the Azure CLI in a shell and parse the token from its output. This is meant for development purposes, where a user has already installed the Azure CLI and logged in with their user account. Pass the following environment to the process: * PATH * HOME * AZURE_CONFIG_DIR Add a token factory to construct a token from the process output. Unlike in Azure Entra and IMDS, the CLI's JSON output does not contain 'expires_in', and the token key is in camel case. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	52a4bd83d5	utils: azure: Get access token from IMDS Implement token request from IMDS. No credentials are required for that - just a plain HTTP request on the IMDS token endpoint. Since the IMDS endpoint is a raw IP, it's not possible to reliably determine whether IMDS is accessible or not (i.e., whether the node is an Azure VM). Azure provides no node-local indication either. In lack of a better choice, attempt to connect and declare failure if the connection is not established within 3 seconds. Use a raw TCP socket for this check, as the HTTP client currently lacks timeout or cancellation support. Perform the check only once, during the first token refresh. For the time being, do not support nodes with multiple user-assigned managed identities. Expect the token request to fail in this case (IMDS requires the identifier of the desired Managed Identity). Add a token factory to correctly parse the HTTP response. This addresses a discrepancy between token requests on IMDS and Azure Entra - the 'expires_in' field is a string in the former and an integer in the latter. Finally, implement a fail-fast retry policy for short-lived transient errors. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	919765fb7f	utils: azure: Get access token with SP certificate Implement token request for Service Principals with a certificate. The request is the same as with a secret, except that the secret is replaced with an assertion. The assertion is a JWT that is signed with the certificate. To be consistent with the Azure C++ SDK, expect the certificate and the associated private key to be encoded in PEM format and be provided in a single file. The docs suggest using 'PS256' for the JWT's 'alg' claim. Since this is not supported by our current JWT library (jwt-cpp), use 'RS256' instead. The JWT also requires a unique identifier for the 'jti' claim. Use a random UUID for that (it should suffice for our use cases). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	a671530af6	utils: azure: Get access token with SP secret Implement token request for Service Principals with a secret. The token request requires a TLS connection. When closing the connection, do not wait for a response to the TLS `close_notify` alert. Azure's OAuth server would ignore it and the Seastar `connected_socket` would hang for 10 seconds. Add log redaction logic to not expose sensitive data from the request and response payloads. Add a token factory to parse the HTTP response. This cannot be shared with other credential types because the JSON format is not consistent. Finally, implement a fail-fast retry policy for short-lived transient errors. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	66c8ffa9bf	utils: rest: Add interface for request/response redaction logic The rest http client, currently used by the AWS and GCP key providers, logs the HTTP requests and responses unaltered. This causes some sensitive data to be exposed (plaintext data encryption keys, credentials, access tokens). Add an interface to optionally redact any sensitive data from HTTP headers and payloads. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	0d0135dc4c	utils: azure: Declare all Azure credential types The goal is to mimic the Azure C++ SDK, which offers a variety of credentials, depending on their type and source. Declare the following credentials: * Service Principal credentials * Managed Identity credentials * Azure CLI credentials * Default credentials Also, define a common exception for SP and MI credentials which are network-based. This patch only defines the API. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	3c4face47b	utils: azure: Define interface for Azure credentials Azure authentication is token based - the client obtains an access token with their credentials, and uses it as a bearer token to authorize requests to Azure services. Define a common API for all credential types. The API will consist of a single `get_access_token()` function that will be returning a new or a cached access token for some resource URI (defines token scope). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Nikos Dragazis	57bc51342e	utils: Introduce base64url_{encode,decode} Add helpers for base64url encoding. base64url is a variant of base64 that uses a URL-safe alphabet. It can be constructed from base64 by replacing the '+' and '/' characters with '-' and '_' respectively. Many implementations also strip the padding, although this is not required by the spec [1]. This will be used in upcoming patches for Azure Key Vault requests that require base64url-encoded payloads. [1] https://datatracker.ietf.org/doc/html/rfc4648#section-5 Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-16 17:14:08 +03:00
Dawid Mędrek	20d0050f4e	cdc: Forbid altering columns of CDC log tables directly The set of columns of a CDC log table should be managed automatically by Scylla, and the user should not have the ability to manipulate them directly. That could lead to disastrous consequences such as a segmentation fault. In this commit, we're restricting those operations. We also provide two validation tests. One of the existing tests had to be adjusted as it modified the type of a column in a CDC log table. Since the test simply verifies that the user has sufficient permissions to perform `ALTER TABLE` on the log table, the test is still valid. Fixes scylladb/scylladb#24643	2025-07-16 15:35:48 +02:00
Patryk Jędrzejczak	a654101c40	Merge 'test.py: add missed parameters that should be passed from test.py to pytest' from Andrei Chekun Several parameters that `test.py` should pass to pytest->boost were missing. This PR adds handling these parameters: `--random-seed` and `--x-log2-compaction-groups` Since this code affected with this issue in 2025.3 and this is only framework change, backport for that version needed. Fixes: https://github.com/scylladb/scylladb/issues/24927 Closes scylladb/scylladb#24928 * https://github.com/scylladb/scylladb: test.py: add bypassing x_log2_compaction_groups to boost tests test.py: add bypassing random seed to boost tests	2025-07-16 15:29:17 +02:00
Avi Kivity	c762425ea7	Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive. This PR addresses the issue in two ways: 1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case). 2) `passwords::check` is moved to a dedicated alien thread. Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However: - The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it. - SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512. - MD5 is no longer considered secure for password hashing. Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers. Fixes https://github.com/scylladb/scylladb/issues/24524 Backport not needed, as it is a new feature. Closes scylladb/scylladb#24924 * github.com:scylladb/scylladb: main: utils: add thread names to alien workers auth: move passwords::check call to alien thread test: wait for 3 clients with given username in test_service_level_api auth: refactor password checking in password_authenticator auth: make SHA-512 the only password hashing scheme for new passwords auth: whitespace change in identify_best_supported_scheme() auth: require scheme as parameter for `generate_salt` auth: check password hashing scheme support on authenticator start	2025-07-16 13:15:54 +03:00
Asias He	6c49b7d0ce	repair: Speed up ranges calculation when small table optimization is on Normally, during bootstrap, in repair_service::bootstrap_with_repair, we need to calculate which range to sync data from carefully for the new node. With small table optimization on, we pass a single full range and all peer nodes to row level repair to sync data with. Now that we only need to pass a single range and full peers, there is no need to calculate the ranges and peers in repair_service::bootstrap_with_repair and drop it later. The calculation takes time which slows down bootstrap, e.g., ``` Jul 08 22:01:41.927785 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]: [shard 0:strm] repair - bootstrap_with_repair: started with keyspace=system_distributed_everywhere, nr_ranges=23809 Jul 08 22:01:57.883797 cluster-scale-50-200-test-scayle-t-db-node-51209daa-93 scylla[5326]: [shard 0:strm] repair - repair[79eac1a1-5d5b-4028-ae1c-06e68bec2d50]: sync data for keyspace=system_distributed_everywhere, status=started, reason=bootstrap, small_table_optimization=true ``` The range calculation took 15 seconds for system_distributed_everywhere table. To fix, the ranges calculation is skipped if small table optimization is on for the keyspace. Before: cluster dev [ PASS ] cluster.test_boot_nodes.1 104.59s After: cluster dev [ PASS ] cluster.test_boot_nodes.1 89.23s A 15% improvement to bootstrap 30 node cluster was observed. Fixes #24817	2025-07-16 15:33:15 +08:00
Piotr Dulikowski	a14b7f71fe	auth: fix crash when migration code runs parallel with raft upgrade The functions password_authenticator::start and standard_role_manager::start have a similar structure: they spawn a fiber which invokes a callback that performs some migration until that migration succeeds. Both handlers set a shared promise called _superuser_created_promise (those are actually two promises, one for the password authenticator and the other for the role manager). The handlers are similar in both cases. They check if auth is in legacy mode, and behave differently depending on that. If in legacy mode, the promise is set (if it was not set before), and some legacy migration actions follow. In auth-on-raft mode, the superuser is attempted to be created, and if it succeeds then the promise is _unconditionally_ set. While it makes sense at a glance to set the promise unconditionally, there is a non-obvious corner case during upgrade to topology on raft. During the upgrade, auth switches from the legacy mode to auth on raft mode. Thus, if the callback didn't succeed in legacy mode and then tries to run in auth-on-raft mode and succeds, it will unconditionally set a promise that was already set - this is a bug and triggers an assertion in seastar. Fix the issue by surrounding the `shared_promise::set_value` call with an `if` - like it is already done for the legacy case. Fixes: scylladb/scylladb#24975 Closes scylladb/scylladb#24976	2025-07-16 10:22:48 +03:00
Michał Chojnowski	1e7a292ef4	sstables/index_reader: extract a prefetch_lower_bound() method The sstable reader reaches directly for a `clustered_index_cursor`. But a BTI index reader won't be able to implement `clustered_index_cursor`, because a BTI index doesn't store full clustering keys, only some trie-encoded prefixes. So we want to weaken the dependency. Instead of reaching for `clustered_index_cursor`, we add a method which expresses our intent, and we let `index_reader` touch the cursor internally.	2025-07-16 00:13:20 +02:00
Andrzej Jackowski	77a9b5919b	main: utils: add thread names to alien workers This commit adds a call to `pthread_setname_np` in `alien_worker::spawn`, so each alien worker thread receives a descriptive name. This makes debugging, monitoring, and performance analysis easier by allowing alien workers to be clearly identified in tools such as `perf`.	2025-07-15 23:29:21 +02:00
Andrzej Jackowski	9574513ec1	auth: move passwords::check call to alien thread Analysis of customer stalls showed that the `detail::hash_with_salt` function, called from `passwords::check`, often blocks the reactor. This function internally uses the `crypt_r` function from an external library to compute password hashes, which is a CPU-intensive operation. To prevent such reactor stalls, this commit moves the `passwords::check` call to a dedicated alien thread. This thread is created at system startup and is shared by all shards. Within the alien thread, an `std::mutex` synchronizes access between the thread and the shards. While this could theoretically cause frequent lock contentions, in practice, even during connection storms, the number of new connections per second per shard is limited (typically hundreds per second). Additionally, the `_conns_cpu_concurrency_semaphore` in `generic_server` ensures that not too many connections are processed at once. Fixes scylladb/scylladb#24524	2025-07-15 23:29:13 +02:00
Andrzej Jackowski	4ac726a3ff	test: wait for 3 clients with given username in test_service_level_api test_service_level_api tests create a new session and wait for all clients to authenticate. However, the check that all connections are authenticated is done by verifying that there are no connections with the username 'anonymous', which is insufficient if new connections have not yet been listed. To avoid test failures, this commit introduces an additional check that verifies all expected clients are present in the system.clients table before proceeding with the test.	2025-07-15 23:28:39 +02:00
Andrzej Jackowski	8d398fa076	auth: refactor password checking in password_authenticator This commit splits an if statement to two ifs, to make it possible to call `password::check` function from another (alien) thread in the next commit of this patch series. Ref. scylladb/scylladb#24524	2025-07-15 23:28:39 +02:00
Andrzej Jackowski	b3c6af3923	auth: make SHA-512 the only password hashing scheme for new passwords Before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However: - The bcrypt algorithms do not work because their scheme prefix lacks the required round count (e.g., it is `$2y$` instead of `$2y$05$`). We suspect this never worked as intended. Moreover, bcrypt tends to be slower than SHA-512, so we do not want to fix the prefix and start using it. - SHA-256 and SHA-512 are both part of the SHA-2 family, and libraries that support one almost always support the other. It is not expected to find a library that supports only SHA-256 but not SHA-512. - MD5 is not considered secure for password hashing. Therefore, this commit removes support for bcrypt_y, bcrypt_a, SHA-256, and MD5 for hashing new passwords to ensure that the correct hashing function (SHA-512) is used everywhere. This commit does not change the behavior of `passwords::check`, so it is still possible to use passwords hashed with the removed algorithms. Ref. scylladb/scylladb#24524	2025-07-15 23:28:33 +02:00
Andrzej Jackowski	62e976f9ba	auth: whitespace change in identify_best_supported_scheme() Remove tabs in `identify_best_supported_scheme()` to facilitate reuse of those lines after the for loop is removed. This change is motivated by the upcoming removal of support for obsolete password hashing schemes and removal of `identify_best_supported_scheme()` function. Ref. scylladb/scylladb#24524	2025-07-15 20:26:39 +02:00
Andrzej Jackowski	b20aa7b5eb	auth: require scheme as parameter for `generate_salt` This is a refactoring commit that changes the `generate_salt` function to require a password hashing scheme as a parameter. This change is motivated by the upcoming removal of support for obsolete password hashing schemes and removal of `identify_best_supported_scheme()` function. Ref. scylladb/scylladb#24524	2025-07-15 20:26:39 +02:00
Andrzej Jackowski	c4e6d9933d	auth: check password hashing scheme support on authenticator start This commit adds a check to the `password_authenticator` to ensure that at least one of the available password hashing schemes is supported by the current environment. It is better to fail at system startup rather than on the first attempt to use the password authenticator. This change is motivated by the upcoming removal of support for obsolete password hashing schemes and removal of `identify_best_supported_scheme()` function. Ref. scylladb/scylladb#24524	2025-07-15 20:26:33 +02:00
Botond Dénes	a26b6a3865	Merge 'storage: add `make_data_or_index_source` to the storages' from Ernest Zaslavsky Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance * Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior. * Add `make_data_or_index_source` to the `storage` interface, implement it for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage` No backport needed since it enhances functionality which has not been released yet fixes: https://github.com/scylladb/scylladb/issues/22458 Closes scylladb/scylladb#23695 * github.com:scylladb/scylladb: sstables: Start using `make_data_or_index_source` in `sstable` sstables: refactor readers and sources to use coroutines sstables: coroutinize futurized readers sstables: add `make_data_or_index_source` to the `storage` encryption: refactor key retrieval encryption: add `encrypted_data_source` class	2025-07-15 13:32:13 +03:00
Andrei Chekun	a8fd38b92b	test.py: skip discovery when combined_test binary absent To discover what tests are included into combined_tests, pytest check this at the very beginning. In the case if combined_tests binary is missing, it will fail discovery and will not run test, even when it was not included into combined_tests. This PR changes behavior, so it will not fail when combined_tests is missing and only fail in case someone tries to run test from it. Closes scylladb/scylladb#24761	2025-07-15 09:49:02 +02:00
Ernest Zaslavsky	8d49bb8af2	sstables: Start using `make_data_or_index_source` in `sstable` Convert all necessary methods to be awaitable. Start using `make_data_or_index_source` when creating data_source for data and index components. For proper working of compressed/checksummed input streams, start passing stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.	2025-07-15 10:10:23 +03:00
Ernest Zaslavsky	dff9a229a7	sstables: refactor readers and sources to use coroutines Refactor readers and sources to support coroutine usage in preparation for integration with `make_data_or_index_source`. Move coroutine-based member initialization out of constructors where applicable, and defer initialization until first use.	2025-07-15 10:10:23 +03:00
Pavel Emelyanov	4debe3af5d	scylla-gdb: Don't show io_queue executing and queued resources These counters are no longer accounted by io-queue code and are always zero. Even more -- accounting removal happened years ago and we don't have Scylla versions built with seastar older than that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24835	2025-07-15 07:41:20 +03:00
Botond Dénes	641a907b37	Merge 'test/alternator: clean up write isolation default and add more tests for the different modes' from Nadav Har'El In #24442 it was noticed that accidentally, for a year now, test.py and CI were running the Alternator functional tests (test/alternator) using one write isolation mode (`only_rmw_uses_lwt`) while the manual test/alternator/run used a different write isolation mode (`always_use_lwt`). There is no good reason for this discrepancy, so in the second patch of this 2-patch series we change test/alternator/run to use the write isolation mode that we've had in CI for the last year. But then, discussion on #24442 started: Instead of picking one mode or the other, don't we need test both modes? In fact, all four modes? The honest answer is that running all tests with all combinations of options is not practical - we'll find ourselves with an exponentially growing number of tests. What we really need to do is to run most tests that have nothing to do with write isolation modes on just one arbitrary write isolation mode like we're doing today. For example, numerous tests for the finer details of the ConditionExpression syntax will run on one mode. But then, have a separate test that verifies that one representative example of ConditionExpression (for example) works correctly on all four write isolation modes - rejected in forbid_rmw mode, allowed and behaves as expected on the other three. We had some tests like that in our test suite already, but the first patch in this series adds many more, making the test much more exhaustive and making it easier to review that we're really testing all four write isolation modes in every scenario that matters. Fixes #24442 No need to backport this patch - it's just adding more tests and changing developer-only test behavior. Closes scylladb/scylladb#24493 * github.com:scylladb/scylladb: test/alternator: make "run" script use only_rmw_uses_lwt test/alternator: improve tests for write isolation modes	2025-07-15 07:16:18 +03:00
Patryk Jędrzejczak	21edec1ace	test: test_zero_token_nodes_multidc: properly handle reads with CL=ONE The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when: - both writes succeeded with the same replica responding first, - one of the following reads succeeded with the other replica responding before it applied mutations from any of the writes. We fix the test by not expecting reads with CL=ONE to return a row. We also harden the test by inserting different rows for every pair (CL, coordinator), where one of the two coordinators is a normal node from DC1, and the other one is a zero-token node from DC2. This change makes sure that, for example, every write really inserts a row. Fixes scylladb/scylladb#22967 The fix addresses CI flakiness and only changes the test, so it should be backported. Closes scylladb/scylladb#23518	2025-07-15 07:14:09 +03:00
Botond Dénes	2d3965c76e	Merge 'Reduce Alternator table name length limit to 192 and fix crash when adding stream to table with very long name' from Nadav Har'El Before this series, it is possible to crash Scylla (due to an I/O error) by creating an Alternator table close to the maximum name length of 222, and then enabling Alternator Streams. This series fixes this bug in two ways: 1. On a pre-existing table whose name might be up to 222 characters, enabling Streams will check if the resulting name is too long, and if it is, fail with a clear error instead of crashing. This case will effect pre-existing tables whose name has between 207 and 222 characters (207 is `222 - strlen("_scylla_cdc_log")`) - for such tables enabling Streams will fail, but no longer crash. 2. For new tables, the table name length limit is lowered from 222 to 192. The new limit is still high enough, but ensures it will be possible to enable streams any new table. It will also always be possible to add a GSI for such a table with name up to 29 characters (if the table name is shorter, the GSI name can be longer - the sum can be up to 221 characters). No need to backport, Alternator Streams is still an experimental feature and this patch just improves the unlikely situation of extremely long table names. Fixes #24598 Closes scylladb/scylladb#24717 * github.com:scylladb/scylladb: alternator: lower maximum table name length to 192 alternator: don't crash when adding Streams to long table name alternator: split length limit for regular and auxiliary tables alternator: avoid needlessly validating table name	2025-07-15 06:57:04 +03:00
Botond Dénes	26f135a55a	Merge 'Make KMIP host do nice TLS close on dropped connection + make PyKMIP test fixure not generate TLS noise + remove boost::process' from Calle Wilund Fixes #24873 In KMIP host, do release of a connection (socket) due to our connection pool for the host being full, we currently don't close the connection properly, only rely on destructors. This just makes sure `release` closes the connection if it neither retains or caches it. Also, when running with the PyKMIP fixture, we tested the port being reachable using a normal socket. This makes python SSL generate errors -> log noise that look like actual errors. Change the test setup to use a proper TLS connection + proper shutdown to avoid the noise logs. This also adds a fixture helper for processes, and moves EAR test to use it (and by extension, seastar::experimental::process) instead of boost::process, removing a nasty non-seastarish dependency. Closes scylladb/scylladb#24874 * github.com:scylladb/scylladb: encryption_test: Make PyKMIP run under seastar::experimental::process test/lib: Add wrapper helper for test process fixtures kmip_host: Close connections properly if dropped by pool being full encryption_at_rest_test: Do port check using TLS	2025-07-15 06:55:34 +03:00
Botond Dénes	1f9f43d267	Merge 'kms_host: Support external temporary security credentials' from Nikos Dragazis This PR extends the KMS host to support temporary AWS security credentials provided externally via the Scylla configuration file, environment variables, or the AWS credentials file. The KMS host already supports: * Temporary credentials obtained automatically from the EC2 instance metadata service or via IAM role assumption. * Long-term credentials provided externally via configuration, environment, or the AWS credentials file. This PR is about temporary credentials that are external, i.e., not generated by Scylla. Such credentials may be issued, for example, through identity federation (e.g., Okta + gimme-aws-creds). External temporary credentials are useful for short-lived tasks like local development, debugging corrupted SSTables with `scylla-sstable`, or other local testing scenarios. These credentials are temporary and cannot be refreshed automatically, so this method is not intended for production use. Documentation has been updated to mention these additional credential sources. Fixes #22470. New feature, no backport is needed. Closes scylladb/scylladb#22465 * github.com:scylladb/scylladb: doc: Expose new `aws_session_token` option for KMS hosts kms_host: Support authn with temporary security credentials encryption_config: Mention environment in credential sources for KMS	2025-07-15 06:45:39 +03:00
Jenkins Promoter	41bc6a8e86	Update pgo profiles - x86_64	2025-07-15 04:54:17 +03:00
Jenkins Promoter	b86674a922	Update pgo profiles - aarch64	2025-07-15 04:49:45 +03:00
Nadav Har'El	a248336e66	alternator: clean up by co-routinizing Reviewers of the previous patch complained on some ugly pre-existing code in alternator/executor.cc, where returning from an asynchronous (future) function require lengthy verbose casts. So this patch cleans up a few instances of these ugly casts by using co_return instead of return. For example, the long and verbose return make_ready_future<executor::request_return_type>( rjson::print(std::move(response))); can be changed to the shorter and more readable co_return rjson::print(std::move(response)); This patch should not have any functional implications, and also not any performance implications: I only coroutinized slow-path functions and one function that was already "partially" coroutinized (and this was expecially ugly and deserved being fixed). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-14 18:41:35 +03:00
Nadav Har'El	13ec94107a	alternator: avoid spamming the log when failing to write response Both make_streamed() and new make_streamed_with_extra_array() functions, used when returning a long response in Alternator, would write an error- level log message if it failed to write the response. This log message is probably not helpful, and may spam the log if the application causes repeated errors intentionally or accidentally. So drop these log messages. The exception is still thrown as usual. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-14 18:41:34 +03:00
Nadav Har'El	d8fab2a01a	alternator: clean up and simplify request_return_type The previous patch introduced a function make_streamed_with_extra_array which was a duplicate of the existing make_streamed. Reviewers complained how baroque the new function is (just like the old function), having to jump through hoops to return a copyable function working on non-copyable objects, making strange-named copies and shared pointers of everything. We needed to return a copyable function (std::function) just because Alternator used Seastar's json::json_return_type in the return type from executor function (request_return_type). This json_return_type contained either a sstring or an std::function, but neither was ever really appropriate: 1. We want to return noncopyable_function, not an std::function! 2. We want to return an std::string (which rjson::print()) returns, not an sstring! So in this patch we stop using seastar::json::json_return_type entirely in Alternator. Alternator's request_return_type is now an std::variant of three types: 1. std::string for short responses, 2. noncopyable_function for long streamed response 3. api_error for errors. The ugliest parts of make_streamed() where we made copies and shared pointers to allow for a copyable function are all gone. Even nicer, a lot of other ugly relics of using seastar::json_return_type are gone: 1. We no longer need obscure classes and functions like make_jsonable() and json_string() to convert strings to response bodies - an operation can simply return a string directly - usually returning rjson::print(value) or a fixed string like "" and it just works. 2. There is no more usage of seastar::json in Alternator (except one minor use of seastar::json::formatter::to_json in streams.cc that can be removed later). Alternator uses RapidJSON for its JSON needs, we don't need to use random pieces from a different JSON library. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-14 18:41:34 +03:00
Nadav Har'El	2385fba4b6	alternator: avoid oversized allocation in Query/Scan This patch fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator. Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page. In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running test/alternator/run --runveryslow \ test_query.py::test_query_large_page_small_rows reports in the log: oversized allocation: 573440 bytes. After this patch, this warning no longer appears. The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that. The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before. Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test. Fixes #23535	2025-07-14 18:41:34 +03:00
Patryk Jędrzejczak	145a38bc2e	Merge 'raft: fix voter assignment of transitioning nodes' from Emil Maskovsky Previously, nodes would become voters immediately after joining, ensuring voter status was established before bootstrap completion. With the limited voters feature, voter assignment became deferred, creating a timing gap where nodes could finish bootstrapping without becoming voters. This timing issue could lead to quorum loss scenarios, particularly observed in tests but theoretically possible in production environments. This commit reorders voter assignment to occur before the `update_topology_state()` call, ensuring nodes achieve voter status before bootstrap operations are marked complete. This prevents the problematic timing gap while maintaining compatibility with limited voters functionality. If voter assignment succeeds but topology state update fails, the operation will raise an exception and be retried by the topology coordinator, maintaining system consistency. This commit also fixes issue where the `update_nodes` ignored leaving voters potentially exceeding the voter limit and having voters unaccounted for. Fixes: scylladb/scylladb#24420 No backport: Fix of a theoretical bug + CI stability improvement (we can backport eventually later if we see hits in branches) Closes scylladb/scylladb#24843 * https://github.com/scylladb/scylladb: raft: fix voter assignment of transitioning nodes raft: improve comments in group0 voter handler	2025-07-14 16:12:03 +02:00
Calle Wilund	722e2bce96	encryption_test: Make PyKMIP run under seastar::experimental::process Removes the requirement of boost::process, and all its non-seastar-ness. Hopefully also makes the IO and shutdown handling a bit more reliable.	2025-07-14 12:18:16 +00:00
Calle Wilund	253323bb64	test/lib: Add wrapper helper for test process fixtures Adds a wrapper for seastar::experimental::process, to help use external process fixtures in unit test. Mainly to share concepts such as line reading of stdout/err etc, and sync the shutdown of these. Also adds a small path searcher to find what you want to run.	2025-07-14 12:18:16 +00:00
Yaron Kaikov	fdcaa9a7e7	dist/common/scripts/scylla_sysconfig_setup: fix `SyntaxWarning: invalid escape sequence` There are invalid escape sequence warnings where raw strings should be used for the regex patterns Fixes: https://github.com/scylladb/scylladb/issues/24915 Closes scylladb/scylladb#24916	2025-07-14 11:20:41 +02:00
Benny Halevy	692b79bb7d	compaction: get_max_purgeable_timestamp: improve trace log messages Print the keyspace.table names, issue trace log messages also when returning early if tombstone_gc is disabled or when gc_check_only_compacting_sstables is set. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24914	2025-07-14 11:16:58 +02:00
Calle Wilund	514fae8ced	kmip_host: Close connections properly if dropped by pool being full Fixes #24873 Note: this happens like never. But if we, in KMIP host, do release of a connection (socket) due to our connection pool for the host being full, we currently don't close the connection properly, only rely on destructors. While not very serious, this would lead to possible TLS errors in the KMIP host used, which should be avoided if possible. Fix is simple, just make release close the connection if it neither retains nor caches it.	2025-07-14 08:31:02 +00:00
Calle Wilund	0fe8836073	encryption_at_rest_test: Do port check using TLS If we connect using just a socket, and don't terminate connection nicely, we will get annoying errors in PyKMIP log. These distract from real errors. So avoid them.	2025-07-14 08:31:02 +00:00
Yaron Kaikov	ed7c7784e4	auto-backport.py: Avoid bot push to existing backport branches Changed the backport logic so that the bot only pushes the backport branch if it does not already exist in the remote fork. If the branch exists, the bot skips the push, allowing only users to update (force-push) the branch after the backport PR is open. Fixes: https://github.com/scylladb/scylladb/issues/24953 Closes scylladb/scylladb#24954	2025-07-14 11:20:23 +03:00
Avi Kivity	6fce817aa8	Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft. Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit(). Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/19649 Fixes https://github.com/scylladb/scylladb/issues/24531 Closes scylladb/scylladb#24886 [avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>] * github.com:scylladb/scylladb: test: add type creation to test_snapshot storage_service: always wake up load balancer on update tablet metadata db: schema_applier: call destroy also when exception occurs db: replica: simplify seeding ERM during shema change db: remove cleanup from add_column_family db: abort on exception during schema commit phase db: make user defined types changes atomic replica: db: make keyspace schema changes atomic db: atomically apply changes to tables and views replica: make truncate_table_on_all_shards get whole schema from table_shards service: split update_tablet_metadata into two phases service: pull out update_tablet_metadata from migration_listener db: service: add store_service dependency to schema_applier service: simplify load_tablet_metadata and update_tablet_metadata db: don't perform move on tablet_hint reference replica: split add_column_family_and_make_directory into steps replica: db: split drop_table into steps db: don't move map references in merge_tables_and_views() db: introduce commit_on_shard function db: access types during schema merge via special storage replica: make non-preemptive keyspace create/update/delete functions public replica: split update keyspace into two phases replica: split creating keyspace into two functions db: rename create_keyspace_from_schema_partition db: decouple functions and aggregates schema change notification from merging code db: store functions and aggregates change batch in schema_applier db: decouple tables and views schema change notifications from merging code db: store tables and views schema diff in schema_applier db: decouple user type schema change notifications from types merging code service: unify keyspace notification functions arguments db: replica: decouple keyspace schema change notifications to a separate function db: add class encapsulating schema merging	2025-07-13 20:47:55 +03:00
Benny Halevy	3feb759943	everywhere: use utils::chunked_vector for list of mutations Currently, we use std::vector<*mutation> to keep a list of mutations for processing. This can lead to large allocation, e.g. when the vector size is a function of the number of tables. Use a chunked vector instead to prevent oversized allocations. `perf-simple-query --smp 1` results obtained for fixed 400MHz frequency and PGO disabled: Before (read path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 89055.97 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 18003 cycles/op, 0 errors) 103372.72 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39380 insns/op, 17300 cycles/op, 0 errors) 98942.27 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39413 insns/op, 17336 cycles/op, 0 errors) 103752.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39407 insns/op, 17252 cycles/op, 0 errors) 102516.77 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39403 insns/op, 17288 cycles/op, 0 errors) throughput: mean= 99528.13 standard-deviation=6155.71 median= 102516.77 median-absolute-deviation=3844.59 maximum=103752.93 minimum=89055.97 instructions_per_op: mean= 39403.99 standard-deviation=14.25 median= 39406.75 median-absolute-deviation=9.30 maximum=39416.63 minimum=39380.39 cpu_cycles_per_op: mean= 17435.81 standard-deviation=318.24 median= 17300.40 median-absolute-deviation=147.59 maximum=18002.53 minimum=17251.75 ``` After (read path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 59755.04 tps ( 66.2 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39466 insns/op, 22834 cycles/op, 0 errors) 71854.16 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 17883 cycles/op, 0 errors) 82149.45 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39411 insns/op, 17409 cycles/op, 0 errors) 49640.04 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 19975 cycles/op, 0 errors) 54963.22 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 18235 cycles/op, 0 errors) throughput: mean= 63672.38 standard-deviation=13195.12 median= 59755.04 median-absolute-deviation=8709.16 maximum=82149.45 minimum=49640.04 instructions_per_op: mean= 39448.38 standard-deviation=31.60 median= 39466.17 median-absolute-deviation=25.75 maximum=39474.12 minimum=39411.42 cpu_cycles_per_op: mean= 19267.01 standard-deviation=2217.03 median= 18234.80 median-absolute-deviation=1384.25 maximum=22834.26 minimum=17408.67 ``` `perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency and PGO disabled: Before (write path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 63736.96 tps ( 59.4 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 49667 insns/op, 19924 cycles/op, 0 errors) 64109.41 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 49992 insns/op, 20084 cycles/op, 0 errors) 56950.47 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50005 insns/op, 20501 cycles/op, 0 errors) 44858.42 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50014 insns/op, 21947 cycles/op, 0 errors) 28592.87 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50027 insns/op, 27659 cycles/op, 0 errors) throughput: mean= 51649.63 standard-deviation=15059.74 median= 56950.47 median-absolute-deviation=12087.33 maximum=64109.41 minimum=28592.87 instructions_per_op: mean= 49941.18 standard-deviation=153.76 median= 50005.24 median-absolute-deviation=73.01 maximum=50027.07 minimum=49667.05 cpu_cycles_per_op: mean= 22023.01 standard-deviation=3249.92 median= 20500.74 median-absolute-deviation=1938.76 maximum=27658.75 minimum=19924.32 ``` After (write path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 53395.93 tps ( 59.4 allocs/op, 16.5 logallocs/op, 14.3 tasks/op, 50326 insns/op, 21252 cycles/op, 0 errors) 46527.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50704 insns/op, 21555 cycles/op, 0 errors) 55846.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50731 insns/op, 21060 cycles/op, 0 errors) 55669.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50735 insns/op, 21521 cycles/op, 0 errors) 52130.17 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50757 insns/op, 21334 cycles/op, 0 errors) throughput: mean= 52713.91 standard-deviation=3795.38 median= 53395.93 median-absolute-deviation=2955.40 maximum=55846.30 minimum=46527.83 instructions_per_op: mean= 50650.57 standard-deviation=182.46 median= 50731.38 median-absolute-deviation=84.09 maximum=50756.62 minimum=50325.87 cpu_cycles_per_op: mean= 21344.42 standard-deviation=202.86 median= 21334.00 median-absolute-deviation=176.37 maximum=21554.61 minimum=21060.24 ``` Fixes #24815 Improvement for rare corner cases. No backport required Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24919	2025-07-13 19:13:11 +03:00
Yaron Kaikov	66ff6ab6f9	packaging: add `ps` command to dependancies ScyllaDB container image doesn't have ps command installed, while this command is used by perftune.py script shipped within the same image. This breaks node and container tuning in Scylla Operator. Fixes: #24827 Closes scylladb/scylladb#24830	2025-07-13 17:09:05 +03:00
Aleksandra Martyniuk	2ec54d4f1a	replica: hold compaction group gate during flush Destructor of database_sstable_write_monitor, which is created in table::try_flush_memtable_to_sstable, tries to get the compaction state of the processed compaction group. If at this point the compaction group is already stopped (and the compaction state is removed), e.g. due to concurrent tablet merge, an exception is thrown and a node coredumps. Add flush gate to compaction group to wait for flushes in compaction_group::stop. Hold the gate in seal function in table::make_memtable_list. seal function is turned into a coroutine to ensure it won't throw. Wait until async_gate is closed before flushing, to ensure that all data is written into sstables. Stop ongoing compactions beforehand. Remove unnecessary flush in tablet_storage_group_manager::merge_completion_fiber. Stop method already flushes the compaction group. Fixes: #23911. Closes scylladb/scylladb#24582	2025-07-13 12:35:19 +03:00
Benny Halevy	0e455c0d45	utils: clear_gently: add support for sets Since set and unordered_set do not allow modifying their stored object in place, we need to first extract each object, clear it gently, and only then destroy it. To achieve that, introduce a new Extractable concept, that extracts all items in a loop and calls clear_gently on each extracted item, until the container is empty. Add respective unit tests for set and unordered_set. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24608	2025-07-13 12:30:45 +03:00
Emil Maskovsky	f6bb5cb7a0	raft: fix voter assignment of transitioning nodes Previously, nodes would become voters immediately after joining, ensuring voter status was established before bootstrap completion. With the limited voters feature, voter assignment became deferred, creating a timing gap where nodes could finish bootstrapping without becoming voters. This timing issue could lead to quorum loss scenarios, particularly observed in tests but theoretically possible in production environments. This commit reorders voter assignment to occur before the `update_topology_state()` call, ensuring nodes achieve voter status before bootstrap operations are marked complete. This prevents the problematic timing gap while maintaining compatibility with limited voters functionality. If voter assignment succeeds but topology state update fails, the operation will raise an exception and be retried by the topology coordinator, maintaining system consistency. This commit also fixes issue where the `update_nodes` ignored leaving voters potentially exceeding the voter limit and having voters unaccounted for. Fixes: scylladb/scylladb#24420	2025-07-11 17:59:12 +02:00
Tomasz Grabiec	dff2b01237	streaming: Avoid deadlock by running view checks in a separate scheduling group This issue happens with removenode, when RBNO is disabled, so range streamer is used. The deadlock happens in a scenario like this: 1. Start 3 nodes: {A, B, C}, RF=2 2. Node A is lost 3. removenode A 4. Both B and C gain ownership of ranges. 5. Streaming sessions are started with crossed directions: B->C, C->B Readers created by sender side exhaust streaming semaphore on B and C. Receiver side attempts to obtain a permit indirectly by calling check_needs_view_update_path(), which reads local tables. That read is blocked and times-out, causing streaming to fail. The streaming writer is already using a tracking-only permit. To avoid that, run the query under a different scheduling group, which translates to the system semaphore instead of the maintenance semaphore, to break the dependency. The gossip group was chosen because it shouldn't be contended and this change should not interfere with it much. Fixes: #24807	2025-07-11 16:30:46 +02:00
Tomasz Grabiec	ee2fa58bd6	service: migration_manager: Run group0 barrier in gossip scheduling group Fixes two issues. One is potential priority inversion. The barrier will be executed using scheduling group of the first fiber which triggers it, the rest will block waiting on it. For example, CQL statements which need to sync the schema on replica side can block on the barrier triggered by streaming. That's undesirable. This is theoretical, not proved in the field. The second problem is blocking the error path. This barrier is called from the streaming error handling path. If the streaming concurrency semaphore is exhausted, and streaming fails due to timeout on obtaining the permit in check_needs_view_update_path(), the error path will block too because it will also attempt to obtain the permit as part of the group0 barrier. Running it in the gossip scheduling group prevents this. Fixes #24925	2025-07-11 16:29:31 +02:00
Andrei Chekun	f7c7877ba6	test.py: add bypassing x_log2_compaction_groups to boost tests Bypassing argument to pytest->boost that was missing.	2025-07-11 12:30:09 +02:00
Andrei Chekun	71b875c932	test.py: add bypassing random seed to boost tests Bypassing argument to pytest->boost that was missing. Fixes: https://github.com/scylladb/scylladb/issues/24927	2025-07-11 12:30:08 +02:00
Gleb Natapov	89f2edf308	api: unregister raft_topology_get_cmd_status on shutdown In `c8ce9d1c60` we introduced raft_topology_get_cmd_status REST api but the commit forgot to unregister the handler during shutdown. Fixes #24910 Closes scylladb/scylladb#24911	2025-07-10 17:16:44 +02:00
Andrei Chekun	64a095600b	test.py: break the loop when there is no tests for pytest Quit from the repeats if the test is under the pytest runner directory and has some typos or is absent. This allows not going several times through the discovery and stopping execution.	2025-07-10 15:09:28 +02:00
Piotr Dulikowski	d9aec89c4e	Merge 'vector_store_client: implement vector_store_client service' from Pawel Pery Vector Store service is a http server which provides vector search index and an ANN (Approximate Nearest Neighbor) functionality. Vector Store retrieves metadata & data from Scylla about indexes using CQL protocol & CDC functionality. Scylla will request ann search using http api. Commits for the patch: - implement initial `vector_store_client` service. It adds also a parameter `vector_store_uri` to the scylla. - refactor sequential_producer as abortable - implement ip addr retrieval from dns. The uri for Vector Store must contains dns name, this commit implements ip addr refreshing functionality - refactor primary_key as a top-level class. It is needed for the forward declaration of a primary_key - implement ANN API. It implements a core ANN search request functionality, adds Vector Store HTTP API description in docs/protocols.md, and implements automatic boost tests with mocked http server for checking error conditions. New feature, should not be backported. Fixes: VECTOR-47 Fixes: VECTOR-45 -~- Closes scylladb/scylladb#24331 * github.com:scylladb/scylladb: vector_store_client: implement ANN API cql3: refactor primary_key as a top-level class vector_store_client: implement ip addr retrieval from dns utils: refactor sequential_producer as abortable vector_store_client: implement initial vector_store_client service	2025-07-10 13:18:20 +02:00
Marcin Maliszkiewicz	ace7d53cf8	test: add type creation to test_snapshot It coverts the case when new type and new keyspace are created together.	2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz	15b4db47c7	storage_service: always wake up load balancer on update tablet metadata Lack of wakeup is error-prone, as it relies on a wakeup occurring elsewhere.	2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz	c62a022b43	db: schema_applier: call destroy also when exception occurs Otherwise objects may be destroyed on wrong shard, and assert will trigger in ~sharded().	2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz	b103fee5b6	db: replica: simplify seeding ERM during shema change We know that caller is running on shard 0 so we can avoid some extra boilerplate.	2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz	44490ceb77	db: remove cleanup from add_column_family Since we abort now on failure during schema commit there is no need for cleanup as it only manages in-memory state. Explicit cf.stop was added to code paths outside of schema merging to avoid unnecessary regressions.	2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz	317da13e90	db: abort on exception during schema commit phase As we have no way to recover from partial commit.	2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz	81c3dabe06	db: make user defined types changes atomic The same order of creation/destruction is preserved as in the original code, looking from single shard point of view. create_types() is called on each shard separately, while in theory we should be able reuse results similarly as diff_rows(). But we don't introduce this optimization yet.	2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz	e3f92328d3	replica: db: make keyspace schema changes atomic Now all keyspace related schema changes are observable on given shard as they would be applied atomically. This is achieved by commit_on_shard() function being non-preemptive (no futures, no co_awaits). In the future we'll extend this to the whole schema and also other subsystems.	2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz	b18cc8145f	db: atomically apply changes to tables and views In this commit we make use of splitted functions introduced before. Pattern is as follows: - in merge_tables_and_views we call some preparatory functions - in schema_applier::update we call non-yielding step - in schema_applier::post_commit we call cleanups and other finalizing async functions Additionally we introduce frozen_schema_diff because converting schema_ptr to global_schema_ptr triggers schema registration and with atomic changes we need to place registration only in commit phase. Schema freezing is the same method global_schema_ptr uses to transport schema across shards (via schema_registry cache).	2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz	19bc6ffcb0	replica: make truncate_table_on_all_shards get whole schema from table_shards Before for views and indexes it was fetching base schema from db (and couple other properties). This is a problem once we introduce atomic tables and views deletion (in the following commit). Because once we delete table it can no longer be fetched from db object, and truncation is performed after atomically deleting all relevant tables/views/indexes. Now the whole relevant schema will be fetched via global_table_ptr (table_shards) object.	2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz	5ad1845bd6	service: split update_tablet_metadata into two phases In following commits calls will be split in schema_applier.	2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz	2f840e51d1	service: pull out update_tablet_metadata from migration_listener It's not a good usage as there is only one non-empty implementation. Also we need to change it further in the following commit which makes it incompatible with listener code.	2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz	fa157e7e46	db: service: add store_service dependency to schema_applier There is already implicit logical dependency via migration_notifier but in the next commits we'll be moving store_service out from it as we need better control (i.e. return a value from the call).	2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz	847d7f4a3a	service: simplify load_tablet_metadata and update_tablet_metadata - remove load_tablet_metadata(), instead we add wake_up_load_balancer flag to update_tablet_metadata(), it reduces number of public functions and also serves as a comment (removed comment with very similar meaning) - reimplement the code to not use mutate_token_metadata(), this way it's more readable and it's also needed as we'll split update_tablet_metadata() in following commits so that we can have subroutine which doesn't yield (for ensuring atomicity)	2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz	e242ae7ee8	db: don't perform move on tablet_hint reference This lambda is called several times so there should be no move. Currently the bug likely doesn't manifest as code does work only on shard 0.	2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz	1c5ec877a7	replica: split add_column_family_and_make_directory into steps This is similar work as for drop_table in previous commit. add_column_family_and_make_directory() behaves exactly the same as before but calls to it in schema_applier will be replaced by calls directly to split steps. Other usages will remain intact as they don't need atomicity (like creating system tables at startup).	2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz	c2cd02272a	replica: db: split drop_table into steps This is done so that actual dropping can be an atomic step which could be composed with other schema operations, and eventually all subsystems modified via raft so that we could introduce atomic changes which span across different subsystems. We split drop_table_on_all_shards() into: - prepare_tables_metadata_change_on_all_shards() - prepare_drop_table_on_all_shards() - drop_table() - cleanup_drop_table_on_all_shards() prepare_tables_metadata_change_on_all_shards() is necessary because when applying multiple schema changes at once (e.g. drop and add tables) we need to lock only once. We add legacy_drop_table_on_all_shards() which behaves exactly like old drop_table_on_all_shards() to be compatible with code which doesn't need to play with atomicity. Usages of legacy_drop_table_on_all_shards() in schema_applier will be replaced with direct calls to split functions in the following commits - that's the place we will take advantage of drop_table not yielding (as it returns void now).	2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz	d00266ac49	db: don't move map references in merge_tables_and_views() Since they are const it's not needed and misleading.	2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz	fdaff143be	db: introduce commit_on_shard function This will be the place for all atomic schema switching operations. Note that atomicity is observed only from single shard point of view. All shards may switch at slightly different times as global locking for this is not feasible.	2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz	2e69016c4f	db: access types during schema merge via special storage Once we create types atomically the code which is before commit may depend on newly added types, so it has to access both old and new types. New storage called in_progress_types_storage was added.	2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz	71bd452075	replica: make non-preemptive keyspace create/update/delete functions public As those operations will be managed by schema_applier class. This will be implemented in following commit.	2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz	dce0e65213	replica: split update keyspace into two phases - first phase is preemptive (prepare_update_keyspace) - second phase is non-preemptive (update_keyspace) This is done so that schema change can be applied atomically. Aditionally create keyspace code was changed to share common part with update keyspace flow. This commit doesn't yet change the behaviour of the code, as it doesn't guarantee atomicity, it will be done in following commits.	2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz	734f79e2ad	replica: split creating keyspace into two functions This is done so that in following commits insert_keyspace can be used to atomically change schema (as it doesn't yield).	2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz	ec270b0b5e	db: rename create_keyspace_from_schema_partition It only creates keyspace metadata.	2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz	9c856b5785	db: decouple functions and aggregates schema change notification from merging code	2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz	32b2786728	db: store functions and aggregates change batch in schema_applier To be used in following commit.	2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz	bc2d028f77	db: decouple tables and views schema change notifications from merging code As post_commit() can't be fully implemented at this stage, it was moved to interim place to keep things working. It will be moved back later.	2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz	af5e0d7532	db: store tables and views schema diff in schema_applier It will be used in subsequent commit for moving notifications code.	2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz	9c8f3216ab	db: decouple user type schema change notifications from types merging code Merging types code now returns generic affected_types structure which is used both for notifications and dropping types. New static function drop_types() replaces dropping lambda used before. While I think it's not necessary for dropping nor notifications to use per shard copies (like it's using before and after this patch) it could just use string parameters or something similar but this requires too many changes in other classes so it's out of scope here.	2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz	ae81497995	service: unify keyspace notification functions arguments Keyspace metadata is not used, only name is needed so we can remove those extra find_keyspace() calls. Moreover there is no need to copy the name.	2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz	45c5c44c2d	db: replica: decouple keyspace schema change notifications to a separate function In following commits we want to separate updating code from committing shema change (making it visible). Since notifications should be issued after change is visible we need to separate them and call after committing. In subsequent commits other notification types will be moved too. We change here order of notification calls with regards to rest of schema updating code. I.e. before keyspace notifications triggered before tables were updated, after the change they will trigger once everything is updated. There is no indication that notification listeners depend on this behaviour.	2025-07-10 10:40:42 +02:00
Marcin Maliszkiewicz	96332964b7	db: add class encapsulating schema merging This commit doesn't yet change how schema merging works but it prepares the ground for it. We split merging code into several functions. Main reasons for it are that: - We want to generalize and create some interface which each subsystem would use. - We need to pull mutation's apply() out of the code because raft will call it directly, and it will contain a mix of mutations from more than one subsystem. This is needed because we have the need to update multiple subsystems atomically (e.g. auth and schema during auto-grant when creating a table). In this commit do_merge_schema() code is split between prepare(), update(), commit(), post_commit(). The idea behind each of these phases is described in the comments. The last 2 phases are not yet implemented as it requires more code changes but adding schema_applier enclosing class will help to create some copied state in the future and implement commit() and post_commit() phases.	2025-07-10 10:40:42 +02:00
Asias He	ccce5f2472	test: Add test_boot_nodes.py A simple add node test which can be used to test add large number of nodes to a cluster.	2025-07-10 10:56:53 +08:00
Andrei Chekun	e34569bd92	test.py: handle max failures for pytest repeats Pytest can handle max failures, but inside one run, and it was not affecting the repeats. Repeats for pytest is just another execution of the process, so there is no connection between them. With additional check, it will respect max fails. Closes scylladb/scylladb#24760	2025-07-09 19:57:58 +02:00
Michael Litvak	fa24fd7cc3	tablets: stop storage group on deallocation When a tablet transitions to a post-cleanup stage on the leaving replica we deallocate its storage group. Before the storage can be deallocated and destroyed, we must make sure it's cleaned up and stopped properly. Normally this happens during the tablet cleanup stage, when table::cleanup_table is called, so by the time we transition to the next stage the storage group is already stopped. However, it's possible that tablet cleanup did not run in some scenario: 1. The topology coordinator runs tablet cleanup on the leaving replica. 2. The leaving replica is restarted. 3. When the leaving replica starts, still in `cleanup` stage, it allocates a storage group for the tablet. 4. The topology coordinator moves to the next stage. 5. The leaving replica deallocates the storage group, but it was not stopped. To address this scenario, we always stop the storage group when deallocating it. Usually it will be already stopped and complete immediately, and otherwise it will be stopped in the background. Fixes scylladb/scylladb#24857 Fixes scylladb/scylladb#24828 Closes scylladb/scylladb#24896	2025-07-09 19:29:14 +03:00
Aleksandra Martyniuk	17272c2f3b	repair: Reduce max row buf size when small table optimization is on If small_table_optimization is on, a repair works on a whole table simultaneously. It may be distributed across the whole cluster and all nodes might participate in repair. On a repair master, row buffer is copied for each repair peer. This means that the memory scales with the number of peers. In large clusters, repair with small_table_optimization leads to OOM. Divide the max_row_buf_size by the number of repair peers if small_table_optimization is on. Use max_row_buf_size to calculate number of units taken from mem_sem. Fixes: https://github.com/scylladb/scylladb/issues/22244. Closes scylladb/scylladb#24868	2025-07-09 16:55:38 +03:00
Avi Kivity	0138afa63b	service: tablet_allocator: avoid large contiguous vector in make_repair_plan() make_repair_plan() allocates a temporary vector which can grow larger than our 128k basic allocation unit. Use a chunked vector to avoid stalls due to large allocations. Fixes #24713. Closes scylladb/scylladb#24801	2025-07-09 12:50:02 +02:00
Pawel Pery	eadbf69d6f	vector_store_client: implement ANN API This patch is a part of vector_store_client sharded service implementation for a communication with vector-store service. It implements a functionality for ANN search request to a vector-store service. It sends request, receive response and after parsing it returns the list of primary keys. It adds json parsing functionality specific for the HTTP ANN API. It adds a hardcoded http request timeout for retrieving response from the Vector Store service. It also adds an automatic boost test of the ANN search interface, which uses a mockup http server in a background to simulate vector-store service. It adds a documentation for HTTP API protocol used used for ANN functionality. Fixes: VS-47	2025-07-09 11:54:51 +02:00
Pawel Pery	5bfce5290e	cql3: refactor primary_key as a top-level class This patch is a part of vector_store_client sharded service implementation for a communication with vector-store service. There is a need for forward declaration of primary_key class. This patch moves a nested definition of select_statement::primary_key (from a cql3::statements namespace) into a standalone class in a cql3::statements namespace. Reference: VS-47	2025-07-09 11:54:51 +02:00
Pawel Pery	1f797e2fcd	vector_store_client: implement ip addr retrieval from dns This patch is a part of vector_store_client sharded service implementation for a communication with vector-store service. It implements functionality for refreshing ip address of the vector-store service dns name and creating a new HTTP client with that address. It also provides cleanup of unused http clients. There are hardcoded intervals for dns refresh and old http clients cleanup, and timeout for requesting new http client. This patch introduces two background tasks - for dns resolving task and for cleanup old http clients. It adds unit tests for possible dns refreshing issues. Reference: VS-47 Fixes: VS-45	2025-07-09 11:54:51 +02:00
Emil Maskovsky	df37c514d3	raft: improve comments in group0 voter handler Enhance code documentation in the group0 voter handler implementation.	2025-07-09 10:40:59 +02:00
Pawel Pery	8d3c33f74a	utils: refactor sequential_producer as abortable This patch is a part of vector_store_client sharded service implementation for a communication with vector-store service. There is a need for abortable sequention_producer operator(). The existing operator() is changed to allow timeout argument with default time_point::max() (as current default usage) and the new operator() is created with abort_source parameter. Reference: VS-47	2025-07-08 16:29:55 +02:00
Pawel Pery	7bf53fc908	vector_store_client: implement initial vector_store_client service This patch is a part of vector_store_client sharded service implementation for a communication with vector-store service. It adds a `services/vector_store_client.{cc\|hh}` sharded service and a configuration parameter `vector_store_uri` with a `http://vector-store.dns.name:port` format. If there will be an error during parsing that parameter there will be an exception during construction. For the future unit testing purposes the patch adds `vector_store_client_tester` as a way to inject mockup functionality. This service will be used by the select statements for the Vector search indexes (see VS-46). For this reason I've added vector_store_client service in the query processor. Reference: VS-47 VS-45	2025-07-08 16:29:55 +02:00
Yaniv Michael Kaul	82fba6b7c0	PowerPC: remove ppc stuff We don't even compile-test it. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#24659	2025-07-08 10:38:23 +03:00
Piotr Dulikowski	6c65f72031	Merge 'batchlog_manager: abort replay of a failed batch on shutdown or node down' from Michael Litvak When replaying a failed batch and sending the mutation to all replicas, make the write response handler cancellable and abort it on shutdown or if some target is marked down. also set a reasonable timeout so it gets aborted if it's stuck for some other unexpected reason. Previously, the write response handler is not cancellable and has no timeout. This can cause a scenario where some write operation by the batchlog manager is stuck indefinitely, and node shutdown gets stuck as well because it waits for the batchlog manager to complete, without aborting the operation. backport to relevant versions since the issue can cause node shutdown to hang Fixes scylladb/scylladb#24599 Closes scylladb/scylladb#24595 * github.com:scylladb/scylladb: test: test_batchlog_manager: batchlog replay includes cdc test: test_batchlog_manager: test batch replay when a node is down batchlog_manager: set timeout on writes batchlog_manager: abort writes on shutdown batchlog_manager: create cancellable write response handler storage_proxy: add write type parameter to mutate_internal	2025-07-07 16:48:07 +02:00
Andrei Chekun	ae6dc46046	test.py: skip cleaning artifacts when -s provided Skip removing any artifacts when -s provided between test.py invocation. Logs from the previous run will be overridden if tests were executed one more time. Fox example: 1. Execute tests A, B, C with parameter -s 2. All logs are present even if tests are passed 3. Execute test B with parameter -s 4. Logs for A and C are from the first run 5. Logs for B are from the most recent run	2025-07-07 15:42:11 +02:00
Patryk Jędrzejczak	2a52834b7f	Merge 'Make it easier to debug stuck raft topology operation.' from Gleb Natapov The series adds more logging and provides new REST api around topology command rpc execution to allow easier debugging of stuck topology operations. Backport since we want to have in the production as quick as possible. Fixes #24860 Closes scylladb/scylladb#24799 * https://github.com/scylladb/scylladb: topology coordinator: log a start and an end of topology coordinator command execution at info level topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc	2025-07-07 15:40:44 +02:00
Michał Hudobski	919cca576f	custom_index: do not create view when creating a custom index Currently we create a view for every index, however for currently supported custom index classes (vector_index) that work is redundant, as we store the index in the external service. This patch adds a way for custom indexes to choose whether to create a view when creating the index and makes it so that for vector indexes the view is not created.	2025-07-07 13:47:07 +02:00
Michał Hudobski	d4002b61dd	custom_index: refactor describe for custom indexes Currently, to describe an index we look at a corresponding view. However for custom indexes the view may not exist (as we are removing the views from vector indexes). This commit adds a way for a custom index class to override the default describing logic and provides such an override for the vector_index class.	2025-07-07 13:47:07 +02:00
Michał Hudobski	5de3adb536	custom_index: remove unneeded duplicate of a static string We have got a duplicate of the same static string and the only usage of one of the copies can be easily replaced	2025-07-07 13:47:06 +02:00
Piotr Dulikowski	ea35302617	Merge 'test: audit: enable syslog audit tests' from Andrzej Jackowski Several audit test issues caused test failures, and in the result, almost all of audit syslog tests were marked with xfail. This patch series enables the syslog audit tests, that should finally pass after the following fixes are introduced: - bring back commas to audit syslog (scylladb#24410 fix) - synchronize audit syslog server - fix parsing of syslog messages - generate unique uuid for each line in syslog audit - allow audit logging from multiple nodes Fixes: scylladb/scylladb#24410 Test improvements, no backport required. Closes scylladb/scylladb#24553 * github.com:scylladb/scylladb: test: audit: use automatic comparators in AuditEntry test: audit: enable syslog audit tests test: audit: sort new audit entries before comparing with expected ones test: audit: check audit logging from multiple nodes test: audit: generate unique uuid for each line in syslog audit test: audit: fix parsing of syslog messages test: audit: synchronize audit syslog server docs: audit: update syslog audit format to the current one audit: bring back commas to audit syslog	2025-07-07 12:45:44 +02:00
Pavel Emelyanov	84e1ac5248	sstables: Move versions static-assertion check to .cc file Thiss check validates that static values of supported versions are "in sync" with each other. It's enough to do it once when compiling sstable_version.cc, not every time the header is included. refs: #1 (not that it helps noticeably, but technically it fits) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24839	2025-07-07 13:16:21 +03:00
Michael Litvak	d7af26a437	test: test_batchlog_manager: batchlog replay includes cdc Add a new test that verifies that when replaying batch mutations from the batchlog, the mutations include cdc augmentation if needed. This is done in order to verify that it works currently as expected and doesn't break in the future.	2025-07-07 12:24:05 +03:00
Michael Litvak	a9b476e057	test: test_batchlog_manager: test batch replay when a node is down Add a test of the batchlog manager replay loop applying failed batches while some replica is down. The test reproduces an issue where the batchlog manager tries to replay a failed batch, doesn't get a response from some replica, and becomes stuck. It verifies that the batchlog manager can eventually recover from this situation and continue applying failed batches.	2025-07-07 12:23:06 +03:00
Michael Litvak	74a3fa9671	batchlog_manager: set timeout on writes Set a timeout on writes of replayed batches by the batchlog manager. We want to avoid having infinite timeout for the writes in case it gets stuck for some unexpected reason. The timeout is set to be high enough to allow any reasonable write to complete.	2025-07-07 12:23:06 +03:00
Michael Litvak	7150632cf2	batchlog_manager: abort writes on shutdown On shutdown of batchlog manager, abort all writes of replayed batches by the batchlog manager. To achieve this we set the appropriate write_type to BATCH, and on shutdown cancel all write handlers with this type.	2025-07-07 12:23:06 +03:00
Michael Litvak	fc5ba4a1ea	batchlog_manager: create cancellable write response handler When replaying a batch mutation from the batchlog manager and sending it to all replicas, create the write response handler as cancellable. To achieve this we define a new wrapper type for batchlog mutations - batchlog_replay_mutation, and this allows us to overload create_write_response_handler for this type. This is similar to how it's done with hint_wrapper and read_repair_mutation.	2025-07-07 12:23:06 +03:00
Michael Litvak	8d48b27062	storage_proxy: add write type parameter to mutate_internal Currently mutate_internal has a boolean parameter `counter_write` that indicates whether the write is of counter type or not. We replace it with a more general parameter that allows to indicate the write type. It is compatible with the previous behavior - for a counter write, the type COUNTER is passed, and otherwise a default value will be used as before.	2025-07-07 12:23:06 +03:00
Nadav Har'El	18b6c4d3c5	alternator: lower maximum table name length to 192 Currently, Alternator allows creating a table with a name up to 222 (max_table_name_length) characters in length. But if you do create a table with such a long name, you can have some difficulties later: You you will not be able to add Streams or GSI or LSI to that table, because 222 is also the absolute maximum length Scylla tables can have and the auxilliary tables we want to create (CDC log, materialized views) will go over this absolute limit (max_auxiliary_table_name_length). This is not nice. DynamoDB users assume that after successfully creating a table, they can later - perhaps much later - decide to add Streams or GSI to it, and today if they chose extremely long names, they won't be able to do this. So in this patch, we lower max_table_name_length from 222 to 192. A user will not be able to create tables with longer names, but the good news is that once successfully creating a table, it will always be possible to enable Streams on it (the CDC log table has an extra 15 bytes in its name, and 192 + 15 is less than 222), and it will be possible to add GSIs with short enough names (if the GSI name is 29 or less, 192 + 29 + 1 = 222). This patch is a trivial one-line code change, but also includes the corrected documentation of the limits, and a fix for one test that previously checked that a table name with length 222 was allowed - and now needs to check 192 because 222 is no longer allowed. Note that if a user has existing tables and upgrades Scylla, it is possible that some pre-existing Alternator tables might have lengths over 192 (up to 222). This is fine - in the previous patches we made sure that even in this case, all operations will still work correctly on these old tables (by not not validating the name!), and we also made sure that attempting to enable Streams may fail when the name is too long (we do not remove those old checks in this patch, and don't plan to remove them in the forseeable future). Note that the limit we chose - 192 characters - is identical to the table name limit we recently chose in CQL. It's nicer that we don't need to memorize two different limits for Alternator and CQL. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-07 11:58:21 +03:00
Nadav Har'El	3ed8e269f9	alternator: don't crash when adding Streams to long table name Currently, in Alternator it is possible to create a table whose name has 222 characters, and then trying to add Streams to that table results in an attempt to create a CDC log table with the same name plus a 15-character suffix "_scylla_cdc_log", which resulted (Ref #24598) in an IO-error and a Scylla shutdown. This patch adds code to the Stream-adding operations (both CreateTable and UpdateTable) that validates that the table's name, plus that 15 character suffix, doesn't exceed max_auxiliary_table_name_length, i.e., 222. After this patch, if you have a table whose name is between 207 and 222 characters, attempting to enable Streams on it will fail with: "Streams cannot be added if the table name is longer than 207 characters." Note that in the future, if we lower max_table_name_length to below 207, e.g., to 192, then it will always be possible to add a stream to any legal table, and the new checks we had here will be mostly redundant. But only "mostly" - not entirely: Checking in UpdateTable is still important because of the possibility that an upgrading user might have a pre-existing table whose name is longer than the new limit, and might try to enable Streams. After this patch, the crash reported in #24598 can no longer happen, so in this sense the bug is solved. However, we still want to lower max_table_name_length from 222 to 192, so that it will always be possible to enable streams on any table with a legal name length. We'll do this in the next patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-07 11:58:13 +03:00
Nadav Har'El	898665ca38	alternator: split length limit for regular and auxiliary tables Alternator has a constant, max_table_name_length=222, which is currently used for two different things: 1. Limiting the length of the name allowed for Alternator table. 2. Limiting the length of some auxiliary tables the user is not aware of, such as a materialized view (whose name is tablename:indexname) or (in the next patch) CDC log table. In principle, there is no reason why these two limits need to be identical - we could lower the table name limit to, say, 192, but still allow the tablename:indexname to be even longer, up to 222 - i.e., allow creating materialized views even on tables whose name has 192 characters. So in this patch we split this variable into two, max_table_name_length and max_auxiliary_table_name_length. At the moment, both are still set to the same value - 222. In a following patch we plan to lower max_table_name_length but leave max_auxiliary_table_name_length at 222. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-07 11:43:49 +03:00
Gleb Natapov	4e6369f35b	topology coordinator: log a start and an end of topology coordinator command execution at info level Those calls a relatively rare and the output may help to analyze issues in production.	2025-07-07 10:46:22 +03:00
Gleb Natapov	c8ce9d1c60	topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc The topology coordinator executes several topology cmd rpc against some nodes during a topology change. A topology operation will not proceed unless rpc completes (successfully or not), but sometimes it appears that it hangs and it is hard to tell on which nodes it did not complete yet. Introduce new REST endpoint that can help with debugging such cases. If executed on the topology coordinator it returns currently running topology rpc (if any) and a list of nodes that did not reply yet.	2025-07-07 10:46:03 +03:00
Nadav Har'El	09aa062ab6	alternator: avoid needlessly validating table name In commit `d8c3b144cb` we fixed #12538: That issue noted that most requests which take a TableName don't need to "validate" the table's name (check that it has allowed characters and length) if the table is found in the schema. We only need to do this validation on CreateTable, or when the table is not found (because in that case, DynamoDB chose to print a validation error instead of table-not-found error). It turns out that the fix missed a couple of places where the name validation was unnecessary, so this patch fixes those remaining places. The original motivation for fixing was #12538 was performance, so it focused just one cheap common requests. But now, we want to be sure we fixed all requests, because of a new motivation: We are considering, due to #24598, to lower the maximum allowed table name length. However, when we'll do that, we'll want the new lower length limit to not apply to already existing tables. For example, it should be possible to delete a pre-existing table with DeleteTable, if it exists, without the command complaining that the name of this table is too long. So it's important to make sure that the table's name is only validated in CreateTable or if the table does not exist. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-07-07 10:05:43 +03:00
Avi Kivity	d4efefbd9c	Merge 'Improve background disposal of tablet_metadata' from Benny Halevy As seen in #23284, when the tablet_metadata contains many tables, even empty ones, we're seeing a long queue of seastar tasks coming from the individual destruction of `tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>`. This change improves `tablet_metadata::clear_gently` to destroy the `tablet_map_ptr` objects on their owner shard by sorting them into vectors, per- owner shard. Also, background call to clear_gently was added to `~token_metadata`, as it is destroyed arbitrarily when automatic token_metadata_ptr variables go out of scope, so that the contained tablet_metadata would be cleared gently. Finally, a unit test was added to reproduce the `Too long queue accumulated for gossip` symptom and verify that it is gone with this change. Fixes #24814 Refs #23284 This change is not marked as fixing the issue since we still need to verify that there is no impact on query performance, reactor stalls, or large allocations, with a large number of tablet-based tables. * Since the issue exists in 2025.1, requesting backport to 2025.1 and upwards Closes scylladb/scylladb#24618 * github.com:scylladb/scylladb: token_metadata_impl: clear_gently: release version tracker early test: cluster: test_tablets_merge: add test_tablet_split_merge_with_many_tables token_metadata: clear_and_destroy_impl when destroyed token_metadata: keep a reference to shared_token_metadata token_metadata: move make_token_metadata_ptr into shared_token_metadata class replica: database: get and expose a mutable locator::shared_token_metadata locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction	2025-07-06 19:43:50 +03:00
Benny Halevy	6e4803a750	token_metadata_impl: clear_gently: release version tracker early No need to wait for all members to be cleared gently. We can release the version earlier since the held version may be awaited for in barriers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-06 15:07:31 +03:00
Benny Halevy	4a3d14a031	test: cluster: test_tablets_merge: add test_tablet_split_merge_with_many_tables Reproduces #23284 Currently skipped in release mode since it requires the `short_tablet_stats_refresh_interval` interval. Ref #24641 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-06 15:07:31 +03:00
Benny Halevy	2c0bafb934	token_metadata: clear_and_destroy_impl when destroyed We have a lot of places in the code where a token_metadata_ptr is kept in an automatic variable and destroyed when it leaves the scope. since it's a referenced counted lw_shared_ptr, the token_metadata object is rarely destroyed in those cases, but when it is, it doesn't go through clear_gently, and in particular its tablet_metadata is not cleared gently, leading to inefficient destruction of potentially many foreign_ptr:s. This patch calls clear_and_destroy_impl that gently clears and destroys the impl object in the background using the shared_token_metadata. Fixes #13381 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-06 15:07:31 +03:00
Benny Halevy	2b2cfaba6e	token_metadata: keep a reference to shared_token_metadata To be used by a following patch to gently clean and destroy the token_data_impl in the background. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-06 15:07:31 +03:00
Benny Halevy	e0a19b981a	token_metadata: move make_token_metadata_ptr into shared_token_metadata class So we can use the local shared_token_metadata instance for safe background destroy of token_metadata_impl:s. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-06 14:22:20 +03:00
Benny Halevy	493a2303da	replica: database: get and expose a mutable locator::shared_token_metadata Prepare for next patch, the will use this shared_token_metadata to make mutable_token_metadata_ptr:s Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-06 14:22:20 +03:00
Benny Halevy	3acca0aa63	locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction Sort all tablet_map_ptr:s by shard_id and then destroy them on each shard to prevent long cross-shard task queues for foreign_ptr destructions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-06 14:20:46 +03:00
Ernest Zaslavsky	8ac2978239	sstables: coroutinize futurized readers Coroutinize futurized readers and sources to get ready for using `make_data_or_index_source` in `sstable`	2025-07-06 09:18:39 +03:00
Ernest Zaslavsky	0de61f56a2	sstables: add `make_data_or_index_source` to the `storage` Add `make_data_or_index_source` to the `storage` interface, implement it for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage`.	2025-07-06 09:18:39 +03:00
Ernest Zaslavsky	7e5e3c5569	encryption: refactor key retrieval Get the encryption schema extension retrieval code out of `wrap_file` method to make it reusable elsewhere	2025-07-06 09:18:39 +03:00
Ernest Zaslavsky	211daeaa40	encryption: add `encrypted_data_source` class Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior. NOTE: The wrapped source MUST read from offset 0, `encrypted_data_source` assumes it is Co-authored-by: Calle Wilund <calle@scylladb.com>	2025-07-06 09:18:39 +03:00
Pavel Emelyanov	4d6385fc27	api: Remove unused get_json_return_type() templates Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24837	2025-07-05 18:42:02 +03:00
Avi Kivity	33225b730d	Merge 'Do not reference db::config by transport::server' from Pavel Emelyanov The db::config is top-level configuration class that includes options for pretty much everything in Scylla. Instead of messing with this large thing, individual services have their own smaller configs, that are initialized with values from db::config. This PR makes it for transport::server (transport::controller will be next) and its cql_server_config. One bad thing not to step on is that updateable_value is not shard-safe (#7316), but the code in controller that creates cql_server_config is already taking care. Closes scylladb/scylladb#24841 * github.com:scylladb/scylladb: transport: Stop using db::config by transport::server transport: Keep uninitialized_connections_semaphore_cpu_concurrency on cql_server_config transport: Move cql_duplicate_bind_variable_names_refer_to_same_variable to cql_server_config transport: Move max_concurrent_requests to struct config transport: Use cql_server_config::max_request_size	2025-07-05 18:39:01 +03:00
Pavel Emelyanov	9b178df7dd	transport: Stop using db::config by transport::server Now the server is self-contained in the way it is being configured by the controller. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-04 15:40:20 +03:00
Pavel Emelyanov	e2c1484d8d	transport: Keep uninitialized_connections_semaphore_cpu_concurrency on cql_server_config This also repeats previous patch for another updateable_value. The thing here is that this config option is passed further to generic_server, but not used by transport::server itslef. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-04 15:40:20 +03:00
Pavel Emelyanov	64ffe67cbd	transport: Move cql_duplicate_bind_variable_names_refer_to_same_variable to cql_server_config Similarly to previous patch -- move yet another updateable_value to let transport::server eventually stop messing with db::config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-04 15:40:14 +03:00
Pavel Emelyanov	b6546ed5ff	transport: Move max_concurrent_requests to struct config This is updateable_value that's initialized from db::config named_value to tackle its shard-unsafety. However, the cql_server_config is created by controller using sharded_parameter() helper, so that is can be safely passed to server. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-04 15:35:55 +03:00
Pavel Emelyanov	6075eca168	transport: Use cql_server_config::max_request_size It's duplicated on config and the transport::server that aggregates the config itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-04 15:34:53 +03:00
Andrei Chekun	d81820f529	test.py: move deleting directory to prepare_dir Instead of explicitly call removing directory move it to prepare_dir method. If the passed pattern is '*' than directory will be deleted, in other casses only files found by pattern	2025-07-04 13:39:42 +02:00
Andrzej Jackowski	55e542e52e	test: audit: use automatic comparators in AuditEntry Replace manual comparator implementations with generated comparators. This simplifies future maintenance and ensures comparators remain accurate when new fields are added. Reorder fields in AuditEntry so the less-than comparator evaluates the most significant fields first.	2025-07-04 13:08:29 +02:00
Andrzej Jackowski	d7711a5f3a	test: audit: enable syslog audit tests Several audit test issues were resolved in numerous commits of this patch series. This commit enables the syslog audit tests, that should finally pass.	2025-07-04 12:40:57 +02:00
Andrzej Jackowski	3ebc693e70	test: audit: sort new audit entries before comparing with expected ones In some corner cases, the order of audit entries can change. For instance, ScyllaDB is allowed to apply BATCH statements in an order different from the order in which they are listed in the statement. To prevent test failures in such cases, this commit sorts new audit entries. Additionally, it is possible that some of the audit entries won't be received by the SYSLOG server immediately. To prevent test failures in this scenario, waiting for the expected number of new audit entries is added.	2025-07-04 12:40:57 +02:00
Andrzej Jackowski	436e86d96a	test: audit: check audit logging from multiple nodes Before this change, the `assert_audit_row_eq` check assumed that audit logs were always generated by the same (first) node. However, this assumption is invalid in a multi-node setup. This commit modifies the check to just verify that one of the nodes in the cluster generated the audit log.	2025-07-04 12:40:57 +02:00
Andrzej Jackowski	2fefa29de7	test: audit: generate unique uuid for each line in syslog audit Audit to TABLE uses a time UUID as a clustering key, while audit to SYSLOG simply appends new lines. As a result, having such a detailed time UUID is unnecessary for SYSLOG. However, TABLE tests expect each line to be unique, and a similar check is performed (and fails) in SYSLOG tests. This commit updates the test framework to generate a unique UUID for each line in SYSLOG audit. This ensures the tests remain consistent for both TABLE and SYSLOG audit.	2025-07-04 12:40:57 +02:00
Andrzej Jackowski	f85e738b11	test: audit: fix parsing of syslog messages Before this commit, there were following issues with parsing of syslog messages in audit tests: - `line_to_row()` function was never called - `line_to_row()` was not prepared for changes introduced in scylladb#23099 (i.e. key=value pairs) - `line_to_row()` didn't handle newlines in queries - `line_to_row()` didn't handle "\\" escaping in queries Due to the aforementioned issues, the syslog audit tests were failing. This commit fixes all of those issues, by parsing each audit syslog message using a regexp.	2025-07-04 12:40:51 +02:00
Pavel Emelyanov	4d4406c5bc	Merge 'test.py: dtest: port next_gating tests from auth_test.py' from Evgeniy Naydanov Copy `auth_test.py` from scylla-dtest test suite, remove all not next_gating tests from it, and make it works with `test.py` As a part of the porting process, remove unused imports and markers, remove non-next_gating tests and tests marked with `required_features("!consistent-topology-changes")` marker. Remove `test_permissions_caching` test because it's too flaky when running using test.py Also, make few time execution optimizations: - remove redundant `time.sleep(10)` - use smaller timeouts for CQL sessions Enable the test in `suite.yaml` (run in dev mode only.) Additional modifications to test.py/dtest shim code: - Modify ManagerClient.server_update_config() method to change multiple config options in one call in addition to one `key: value` pair. - Implement the method using slightly modified `set_configuration_options()` method of `ScyllaCluster`. - Copy generate_cluster_topology() function from tools/cluster_topology.py module. - Add support for `bootstrap` parameter for `new_node()` function. - Rework `wait_for_any_log()` function. Closes scylladb/scylladb#24648 * github.com:scylladb/scylladb: test.py: dtest: make auth_test.py run using test.py test.py: dtest: rework wait_for_any_log() test.py: dtest: add support for bootstrap parameter for new_node test.py: dtest: add generate_cluster_topology() function test.py: dtest: add ScyllaNode.set_configuration_options() method test.py: pylib/manager_client: support batch config changes test.py: dtest: copy unmodified auth_test.py test.py: dtest: add missed markers to pytest.ini	2025-07-04 10:51:52 +03:00
Botond Dénes	258bf664ee	scylla-gdb.py: sstable-summary: adjust for raw-tokens `01466be7b9` changed the summary entries, storing raw tokens in them, instead of dht::token. Adjust the command so that it works with both pre- and post- versions. Also make it accept pointers to sstables as arguments, this is what scylla sstables listing provides. Closes scylladb/scylladb#24759	2025-07-04 10:44:25 +03:00
Patryk Jędrzejczak	8d925b5ab4	test: increase the default timeout of graceful shutdown Multiple tests are currently flaky due to graceful shutdown timing out when flushing tables takes more than a minute. We still don't understand why flushing is sometimes so slow, but we suspect it is an issue with new machines spider9 and spider11 that CI runs on. All observed failures happened on these machines, and most of them on spider9. In this commit, we increase the timeout of graceful shutdown as a temporary workaround to improve CI stability. When we get to the bottom of the issue and fix it, we will revert this change. Ref #12028 It's a temporary workaround to improve CI stability, we don't have to backport it. Closes scylladb/scylladb#24802	2025-07-04 10:43:38 +03:00
Avi Kivity	60f407bff4	storage_proxy: avoid large allocation when storing batch in system.batchlog Currently, when computing the mutation to be stored in system.batchlog, we go through data_value. In turn this goes through `bytes` type (#24810), so it causes a large contiguous allocation if the batch is large. Fix by going through the more primitive, but less contiguous, atomic_cell API. Fixes #24809. Closes scylladb/scylladb#24811	2025-07-04 10:43:05 +03:00
Avi Kivity	5cbeae7178	sstables: drop minimum_key(), maximum_key() Not used. Closes scylladb/scylladb#24825	2025-07-04 10:42:44 +03:00
Dawid Mędrek	a151944fa6	treewide: Replace __builtin_expect with (un)likely C++20 introduced two new attributes--likely and unlikely--that function as a built-in replacement for __builtin_expect implemented in various compilers. Since it makes code easier to read and it's an integral part of the language, there's no reason to not use it instead. Closes scylladb/scylladb#24786	2025-07-03 13:34:04 +03:00
dependabot[bot]	59cc496757	build(deps): bump sphinx-scylladb-theme from 1.8.6 to 1.8.7 in /docs Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.6 to 1.8.7. - [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases) - [Commits](https://github.com/scylladb/sphinx-scylladb-theme/compare/1.8.6...1.8.7) --- updated-dependencies: - dependency-name: sphinx-scylladb-theme dependency-version: 1.8.7 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#24805	2025-07-03 12:04:24 +03:00
Gleb Natapov	ca7837550d	topology coordinator: do not set request_type field for truncation command if topology_global_request_queue feature is not enabled yet Old nodes do not expect global topology request names to be in request_type field, so set it only if a cluster is fully upgraded already. Closes scylladb/scylladb#24731	2025-07-02 17:09:29 +02:00
Pavel Emelyanov	fa0077fb77	Merge 'S3 chunked download source bug fixes' from Ernest Zaslavsky - Fix missing negation in the `if` in the background downloading fiber - Add test to catch this case - Improve the s3 proxy to inject errors if the same resource requested more than once - Suppress client retry since retrying the same request when each produces multiple buffers may lead to the same data appear more than once in the buffer deque - Inject exception from the test to simulate response callback failure in the middle No need to backport anything since this class in not used yet Closes scylladb/scylladb#24657 * github.com:scylladb/scylladb: s3_test: Add s3_client test for non-retryable error handling s3_test: Add trace logging for default_retry_strategy s3_client: Fix edge case when the range is exhausted s3_client: Fix indentation in try..catch block s3_client: Stop retries in chunked download source s3_client: Enhance test coverage for retry logic s3_client: Add test for Content-Range fix s3_client: Fix missing negation s3_client: Refine logging s3_client: Improve logging placement for current_range output	2025-07-02 14:45:10 +03:00
Patryk Jędrzejczak	fa982f5579	docs: handling-node-failures: fix typo Replacing "from" is incorrect. The typo comes from recently merged #24583. Fixes #24732 Requires backport to 2025.2 since #24583 has been backported to 2025.2. Closes scylladb/scylladb#24733	2025-07-02 12:22:01 +03:00
Konstantin Osipov	37fc4edeb5	test.py: add a way to provide pytest arguments via test.py Now that we use a single pytest.ini for all tests, different developer preferences collide. There should be an easy way to override pytest.ini defaults from the command line. Fixes https://github.com/scylladb/scylladb/issues/21800 Closes scylladb/scylladb#24573	2025-07-02 12:20:43 +03:00
Nikos Dragazis	fbc9ead182	doc: Expose new `aws_session_token` option for KMS hosts Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-02 12:04:40 +03:00
Nikos Dragazis	4c66769e07	kms_host: Support authn with temporary security credentials There are two types of AWS security credentials: * long-term credentials (access key id + secret access key) * temporary credentials (access key id + secret access key + session token) The KMS host can obtain these credentials from multiple sources: * IMDS (config option `aws_use_ec2_credentials`) * STS, by assuming an IAM role (config option `aws_assume_role_arn`) * Scylla config (options `aws_access_key_id`, `aws_secret_access_key`) * Env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) * AWS credentials file (~/.aws/credentials) First two sources return temporary credentials. The rest return long-term credentials. Extend the KMS host to support temporary credentials from the other three sources as well. Introduce the config option `aws_session_token`, and parse the same-named env var and config option from the credentials file. Also, support `aws_security_token` as an alias, for backwards compatibility. This patch facilitates local debugging of corrupted SSTables, as well as testing, using temporary credentials obtained from STS through other authentication means (e.g., Okta + gimme-aws-creds). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-02 12:04:40 +03:00
Nikos Dragazis	37894c243d	encryption_config: Mention environment in credential sources for KMS The help string for the `--kms-hosts` command-line option mentions only the AWS credentials file as a fall-back search path, in case no explicit credentials are given. Extend the help string to mention the environment as well. Make it clear that the environment has higher precedence than the credentials file. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-07-02 12:04:40 +03:00
Avi Kivity	dfaed80f55	Merge 'types: add byte-comparable format support for native cql3 types' from Lakshmi Narayanan Sreethar This PR introduces a new `comparable_bytes` class to add byte-comparable format support for all the [native cql3 data types](https://opensource.docs.scylladb.com/stable/cql/types.html#native-types) except `counter` type as that is not comparable. The byte-comparable format is a pre-requisite for implementing the trie based index format for our sstables(https://github.com/scylladb/scylladb/issues/19191). This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md Note that support for composite data types like lists, maps, and sets has not been implemented yet and will be made available in a separate PR. Refs https://github.com/scylladb/scylladb/issues/19407 New feature - backport not required. Closes scylladb/scylladb#23541 * github.com:scylladb/scylladb: types/comparable_bytes: add testcase to verify compatibility with cassandra types/comparable_bytes: support variable-length natively byte-ordered data types types/comparable_bytes: support decimal cql3 types types/comparable_bytes: introduce count_digits() method types/comparable_bytes: support uuid and timeuuid cql3 types types/comparable_bytes: support varint cql3 type types/comparable_bytes: support skipping sign byte write in decode_signed_long_type types/comparable_bytes: introduce encode/decode_varint_length types/comparable_bytes: support float and double cql3 types types/comparable_bytes: support date, time and timestamp cql3 types types/comparable_bytes: support bigint cql3 type types/comparable_bytes: support fixed length signed integers types/comparable_bytes: support boolean cql3 type types: introduce comparable_bytes class bytes_ostream: overload write() to support writing from FragmentedView docs: fix minor typo in docs/dev/cql3-type-mapping.md	2025-07-02 11:58:32 +03:00
Avi Kivity	1e0b015c8b	Merge 'cql3: Represent create_statement using managed_bytes' from Dawid Mędrek When describing a table, we need to do it carefully: if some columns were dropped, we must specify that explicitly by ``` ALTER TABLE {table} DROP {column} USING TIMESTAMP ... ``` in the result of the DESCRIBE statement. Failing to do so could lead to data resurrection. However, if a table has been altered many, many times, we might end up with a huge create statement. Constructing it could, in turn, trigger an oversized allocation. Some tests ran into that very problem in fact. In this commit, we want to mitigate the problem: instead of allocating a contiguous chunk of memory for the create statement, we use `bytes_ostream` and `managed_bytes` to possibly keep data scattered in memory. It makes handling `cql3::description` less convenient in the code, but since the struct is pretty much immediately serialized after creating it, it's a very good trade-off. A reproducer is intentionally not provided by this commit: it's easy to test the change, but adding and dropping a huge number of columns would take a really long amount of time, so we need to omit it. Fixes scylladb/scylladb#24018 Backport: all of the supported versions are affected, so we want to backport the changes there. Closes scylladb/scylladb#24151 * github.com:scylladb/scylladb: cql3/description: Serialize only rvalues of description cql3: Represent create_statement using managed_string cql3/statements/describe_statement.cc: Don't copy descriptions cql3: Use managed_bytes instead of bytes in DESCRIBE utils/managed_string.hh: Introduce managed_string and fragmented_ostringstream	2025-07-01 21:59:38 +03:00
Lakshmi Narayanan Sreethar	5f5a8cf54c	types/comparable_bytes: add testcase to verify compatibility with cassandra	2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar	6c1853a830	types/comparable_bytes: support variable-length natively byte-ordered data types The following cql3 data types - ascii, blob, duration, inet, and text - are natively byte-ordered in their serialized forms. To encode them into a byte-comparable format, zeros are escaped, and since these types have variable lengths, the encoded form is terminated in an escaped state to mark its end. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar	5c77d17834	types/comparable_bytes: support decimal cql3 types The decimal cql3 type is internally stored as a scale and an unscaled integer. To convert them into a byte comparable format, they are first normalized into a base-100 exponent and a mantissa that lies in [0.01, 1) and then encoded into a byte sequence that preserves the numerical order. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar	832236d044	types/comparable_bytes: introduce count_digits() method Implemented a method `count_digits()` to return the number of significant digits in a given boost::multiprecision:cpp_int. This is required to convert big_decimal to a byte comparable format. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar	a00c5d3899	types/comparable_bytes: support uuid and timeuuid cql3 types The uuid type values are composed of two fixed-length unsigned integers: an msb and an lsb. The msb contains a version digit, which must be pulled first in a byte-comparable representation. For version 1 uuids, in addition to extracting the version digit first, the msb must be rearranged to make it byte comparable. The lsb is written as is. For the timeuuid type, the msb is handled simliar to the version 1 uuid values. The lsb however is treated differently - the sign bits of all bytes are inverted to preserve the legacy comparison order, which compared individual bytes as signed values. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:08 +05:30
Lakshmi Narayanan Sreethar	4592b9764c	types/comparable_bytes: support varint cql3 type Any varint value less than 7 bytes is encoded using the signed long encoding format and remaining values are all encoded using the full form encoding : <signbyte><length as unsigned integer - 7><7 or more bytes>, where <signbyte> is 00 for negative numbers and FF for positive ones, and the length's bytes are inverted if the number is negative (so that longer length sorts smaller). Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar	1b6b0a665d	types/comparable_bytes: support skipping sign byte write in decode_signed_long_type The decode_signed_long_type() method writes leading sign bytes when decoding a byte-comparable encoded signed long value. The varint decoder depends on this method to decode values up to a certain length and expects the decoded form to include sign-only bytes only when necessary. Update the decode_signed_long_type() code to allow skipping the write of sign-only bytes based on the caller's request. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar	ad45a19373	types/comparable_bytes: introduce encode/decode_varint_length The length of a varint value is encoded separately as an unsigned variable-length integer. For negative varint values, the encoded bytes are flipped to ensure that longer lengths sort smaller. This patch implements both encoding and decoding logic for varint lengths and will be used by the subsequent patch. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar	7af153c237	types/comparable_bytes: support float and double cql3 types The sign bit is flipped for positive values to ensure that they are ordered after negative values. For negative values, all the bytes are inverted, allowing larger negative values to be ordered before smaller ones. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar	0145c1d705	types/comparable_bytes: support date, time and timestamp cql3 types Both the date and time cql3 types are internally unsigned fixed length integers. Their serialized form is already byte comparable, so the encoder and decoder return the serialized bytes as it is. The timestamp type is encoded using the fixed length signed integer encoding. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar	b6ff3f5304	types/comparable_bytes: support bigint cql3 type The bigint type, internally implemented as a long data type, is encoded using a variable-length encoding similar to UTF-8. This enables a significant amount of space to be saved when smaller numbers are frequently used, while still permitting large values to be efficiently encoded. The first bit of the encoding represents the inverted sign (i.e., 1 for positive, 0 for negative), followed by length encoded as a sequence of bits matching the inverted sign. This is then followed by a differing bit (except for 9-byte encodings) and the bits of the number's two's complement. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar	c0d25060bd	types/comparable_bytes: support fixed length signed integers To encode fixed-length signed integers in a byte-comparable format, the first bit of each value is inverted. This ensures that negative numbers are ordered before positive ones during comparison. This patch adds support for the data types : byte_type (tinyint), short_type (smallint), and int32_type (int). Although long_type (bigint) is a fixed length integer type, it has different byte comparable encoding and will be handled separately in another patch. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar	8572afca2b	types/comparable_bytes: support boolean cql3 type Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar	74c556a33d	types: introduce comparable_bytes class This patch implements a new class, `comparable_bytes`, designed to implement methods for converting data values to and from byte-comparable formats. The class stores the comparable bytes as `managed_bytes` and currently provides the structure for all required methods. The actual logic for converting various data types will be implemented in subsequent patches. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar	e4c7cb7834	bytes_ostream: overload write() to support writing from FragmentedView Overloaded write() method to support writing a FragmentedView into bytes_ostream. Also added a testcase to verify the implementation. The new helper will be used by the byte_comparable implementation during the encode/decode process. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:07 +05:30
Lakshmi Narayanan Sreethar	068e74b457	docs: fix minor typo in docs/dev/cql3-type-mapping.md Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-07-01 22:19:07 +05:30
Ernest Zaslavsky	acf15eba8e	s3_test: Add s3_client test for non-retryable error handling Introduce a test that injects a non-retryable error and verifies that the chunked download source throws an exception as expected.	2025-07-01 18:45:17 +03:00
Ernest Zaslavsky	a5246bbe53	s3_test: Add trace logging for default_retry_strategy Introduce trace-level logging for `default_retry_strategy` in `s3_test` to improve visibility into retry logic during test execution.	2025-07-01 18:45:17 +03:00
Ernest Zaslavsky	49e8c14a86	s3_client: Fix edge case when the range is exhausted Handle case where the download loop exits after consuming all data, but before receiving an empty buffer signaling EOF. Without this, the next request is sent with a non-zero offset and zero length, resulting in "Range request cannot be satisfied" errors. Now, an empty buffer is pushed to indicate completion and exit the fiber properly.	2025-07-01 18:45:17 +03:00
Ernest Zaslavsky	e50f247bf1	s3_client: Fix indentation in try..catch block Correct indentation in the `try..catch` block to improve code readability and maintain consistent formatting.	2025-07-01 18:45:17 +03:00
Ernest Zaslavsky	d2d69cbc8c	s3_client: Stop retries in chunked download source Disable retries for S3 requests in the chunked download source to prevent duplicate chunks from corrupting the buffer queue. The response handler now throws an exception to bypass the retry strategy, allowing the next range to be attempted cleanly. This exception is only triggered for retryable errors; unretryable ones immediately halt further requests.	2025-07-01 18:45:17 +03:00
Ernest Zaslavsky	c75acd274c	s3_client: Enhance test coverage for retry logic Extend the S3 proxy to support error injection when the client makes multiple requests to the same resource—useful for testing retry behavior and failure handling.	2025-07-01 18:45:17 +03:00
Ernest Zaslavsky	ec59fcd5e4	s3_client: Add test for Content-Range fix Introduce a test that accurately verifies the Content-Range behavior, ensuring the previous fix is properly validated.	2025-07-01 18:45:17 +03:00
Ernest Zaslavsky	6d9cec558a	s3_client: Fix missing negation Restore a missing `not` in a conditional check that caused incorrect behavior during S3 client execution.	2025-07-01 18:45:17 +03:00
Ernest Zaslavsky	e73b83e039	s3_client: Refine logging Fix typo in log message to improve clarity and accuracy during S3 operations.	2025-07-01 18:45:17 +03:00
Ernest Zaslavsky	f1d0690194	s3_client: Improve logging placement for current_range output Relocated logging to occur after determining the `current_range`, ensuring more relevant output during S3 client operations.	2025-07-01 18:45:17 +03:00
Tomasz Grabiec	97679002ee	Merge 'Co-locate tablets of different tables' from Michael Litvak Add the option to co-locate tablets of different tables. For example, a base table and its CDC table, or a local index. main changes and ideas: * "table group" - a set of one or more tables that should be co-located. (Example: base table and CDC table). A group consists of one base table and zero or more children tables. * new column `base_table` in `system.tablets`: when creating a new table, it can be set to point to a base table, which the new table's tablets will be co-located with. when it's set, the tablet map information should be retrieved from the base table map. the child map doesn't contain per-tablet information. * co-located tables always have the same tablet count and the same tablet replicas. each tablet operation - migration, resize, repair - is applied on all tablets in a synchronized manner by the topology coordinator. * resize decision for a group is made by combining the per-table hints and comparing the average tablet size (over all tablets in the group) with the target tablet size. * the tablets load balancer works with the base table as a representative of the group. it represents a single migration unit with some `group_size` that is taken into account. * view tablets are co-located with base tablets when the partition keys match. Fixes https://github.com/scylladb/scylladb/issues/17043 backport is not needed. this is preliminary work for support of MVs and CDC with tablets. Closes scylladb/scylladb#22906 * github.com:scylladb/scylladb: tablets: validate no clustering row mutations on co-located tables raft_group0_client: extend validate_change to mixed_change type docs: topology-over-raft: document co-located tables tablet-mon.py: visual indication for co-located tablets tablet-mon.py: handle co-located tablets test/boost/view_schema_test.cc: fix race in wait_until_built boost/tablets_test: test load balancing and resize of co-located tablets test/tablets: test tablets colocation tablets: co-locate view tablets with base when the partition keys match test/pylib/tablets: common get_tablet_count api test_mv_tablets: use get_tablet_replicas from common tablets api test/pylib/tablets: fix test api to read tablet replicas from base table tablets: allocator: create co-located tables in a single operation alternator: prepare all new tables in a single announcement migration_manager: add notification for creating multiple tables tablets: read_tablet_transition_stage: read from base table storage service: allow repair request only on base tables tablets: keyspace_rf_change: apply on base table storage service: generate tablet migration updates on base tables tablets: replace all_tables method tablets: split when all co-located tablets are ready tablets: load balancer: sizing plan for table groups tablets: load balancer: handle co-located tablets tablets: allocate co-located tablets tablets: handle migration of co-located tablets storage service: add repair colocated tablets rpc tablets: save and read tablet metadata of co-located tables tablets: represent co-located tables in tablet metadata tablets: add base_table column to system.tablets docs: update system.tablets schema	2025-07-01 16:02:30 +02:00
Tomasz Grabiec	6290b70d53	Merge 'repair: postpone repair until topology is not busy ' from Aleksandra Martyniuk Currently, repair_service::repair_tablets starts repair if there is no ongoing tablet operations. The check does not consider global topology operations, like tablet resize finalization. Hence, if: - topology is in the tablet_resize_finalization state; - repair starts (as there is no tablet transitions) and holds the erm; - resize finalization finishes; then the repair sees a topology state different than the actual - it does not see that the storage groups were already split. Repair code does not handle this case and it results with on_internal_error. Start repair when topology is not busy. The check isn't atomic, as it's done on a shard 0. Thus, we compare the topology versions to ensure that the business check is valid. Fixes: https://github.com/scylladb/scylladb/issues/24195. Needs backport to all branches since they are affected Closes scylladb/scylladb#24202 * github.com:scylladb/scylladb: test: add test for repair and resize finalization repair: postpone repair until topology is not busy	2025-07-01 16:02:22 +02:00
Botond Dénes	37ef9efb4e	docs: cql/types.rst: remove reference to frozen-only UDTs ScyllaDB supports non-frozen UDTs since 3.2, no need to keep referencing this limitation in the current docs. Replace the description of the limitation with general description of frozen semantics for UDTs. Fixes: #22929 Closes scylladb/scylladb#24763	2025-07-01 16:19:18 +03:00
Michał Chojnowski	a29724479a	utils/alien_worker: fix a data race in submit() We move a `seastar::promise` on the external worker thread, after the matching `seastar::future` was returned to the shard. That's illegal. If the `promise` move occurs concurrently with some operation (move, await) on the `future`, it becomes a data race which could cause various kinds of corruption. This patch fixes that by keeping the promise at a stable address on the shard (inside a coroutine frame) and only passing through the worker. Fixes #24751 Closes scylladb/scylladb#24752	2025-07-01 15:13:04 +03:00
Łukasz Paszkowski	a22d1034af	test.py: Fix test_compactionhistory_rows_merged_time_window_compaction_strategy The test has two major problems 1. Wrongly computed time windows. Data was not spread across two 1-minute windows causing the test to generate even three sstables instead of two 2. Timestamp was not propagated to the prepared CQL statements. So in fact, a current time was used implicitly 3. Because of the incorrect timestamp issue, the remaining tests testing purged tombstones were affected as well. Fixes https://github.com/scylladb/scylladb/issues/24532 Closes scylladb/scylladb#24609	2025-07-01 15:01:21 +03:00
Avi Kivity	609cc20d22	build: update toolchain to Fedora 42 with clang 20.1 and libstdc++ 15 Rebase to Fedora 42 with clang 20.1 and libstdc++ 15. JAVA8_HOME environment variable dropped since we no longer use it. cassandra-stress package updated with version that doesn't depend on no-longer-available Java 11. Optimized clang binaries generates and stored in https://devpkg.scylladb.com/clang/clang-20.1.7-Fedora-42-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-20.1.7-Fedora-42-x86_64.tar.gz Closes scylladb/scylladb#23978	2025-07-01 14:39:47 +03:00
Dawid Mędrek	9d03dcd28e	cql3/description: Serialize only rvalues of description We discard instances of `cql3::description` right after serializing them, so let's change the signature of the function to save some work.	2025-07-01 12:58:11 +02:00
Dawid Mędrek	ac9062644f	cql3: Represent create_statement using managed_string When describing a table, we need to do it carefully: if some columns were dropped, we must specify that explicitly by ``` ALTER TABLE {table} DROP {column} USING TIMESTAMP ... ``` in the result of the DESCRIBE statement. Failing to do so could lead to data resurrection. However, if a table has been altered many, many times, we might end up with a huge create statement. Constructing it could, in turn, trigger an oversized allocation. Some tests ran into that very problem in fact. In this commit, we want to mitigate the problem: instead of allocating a contiguous chunk of memory for the create statement, we use `fragmented_ostringstream` and `managed_string` to possibly keep data scattered in memory. It makes handling `cql3::description` less convenient in the code, but since the struct is pretty much immediately serialized after creating it, it's a very good trade-off. We provide a reproducer. It consistently passes with this commit, while having about 50% chance of failure before it (based on my own experiments). Playing with the parameters of the test doesn't seem to improve that chance, so let's keep it as-is. Fixes scylladb/scylladb#24018	2025-07-01 12:58:02 +02:00
Michael Litvak	a5feb80797	tablets: validate no clustering row mutations on co-located tables When preparing a tablet metadata change, add another validation that no clustering row mutations are written to the tablet map of a co-located dependent table. A co-located table should never have clustering rows in the `system.tablets` table. It has only the static row with base_table column set, pointing to the base table.	2025-07-01 13:20:20 +03:00
Michael Litvak	6619e798e7	raft_group0_client: extend validate_change to mixed_change type The function validate_change in raft_group0_client is used currently to validate tablet metadata changes, and therefore it applies only to commands of type topology_change. But the type mixed_change also allows topology change mutations and it's in fact used for tablet metadata changes, for example in keyspace_rf_change. Therefore, extend validate_change to validate also changes of type mixed_change, so we can catch issues there as well.	2025-07-01 13:20:19 +03:00
Michael Litvak	6fa5d2f7c8	docs: topology-over-raft: document co-located tables	2025-07-01 13:20:19 +03:00
Michael Litvak	cb9e03bd09	tablet-mon.py: visual indication for co-located tablets Add a visual indication for groups of co-located tablets in tablet-mon.py. We order the tablets by groups, and draw a rectangle that connects tablets that are co-located	2025-07-01 13:20:19 +03:00
Michael Litvak	b35b7c4970	tablet-mon.py: handle co-located tablets For co-located tablets we need to read the tablet information from the tablet map referenced by base_table. Fix tablet-mon.py to handle co-located tablets by checking if base_table is set when reading the tablets of a table, and if so refer to the base table map.	2025-07-01 13:20:19 +03:00
Michael Litvak	fb18b95b3c	test/boost/view_schema_test.cc: fix race in wait_until_built create the view waiter before creating the view, otherwise if the waiter is created after the view is built we may lose the notification.	2025-07-01 13:20:19 +03:00
Michael Litvak	3b4af89615	boost/tablets_test: test load balancing and resize of co-located tablets Add unit tests of load balancing and resize with co-located tablets.	2025-07-01 13:20:19 +03:00
Michael Litvak	65ed0548d6	test/tablets: test tablets colocation Add tests with co-located tablets, testing migration and other relevant operations.	2025-07-01 13:20:19 +03:00
Michael Litvak	7211c0b490	tablets: co-locate view tablets with base when the partition keys match For a view table that has the same partition key as its base table, the view's tablets are co-located with the base tablets Fixes scylladb/scylladb#17043	2025-07-01 13:20:19 +03:00
Michael Litvak	e01aae7871	test/pylib/tablets: common get_tablet_count api Introduce a common get_tablet_count test api instead of it being duplicated in few tests, and fix it to read the tablet count from the base table.	2025-07-01 13:20:19 +03:00
Michael Litvak	e719da3739	test_mv_tablets: use get_tablet_replicas from common tablets api Replace the duplicated get_tablet_replicas method in test_mv_tablets with the common method from the tablets api, to reduce code duplication and use the correct method that reads the tablet replicas from the base table.	2025-07-01 13:20:19 +03:00
Michael Litvak	6bfb82844f	test/pylib/tablets: fix test api to read tablet replicas from base table When reading tablet replicas from system.tablets, we need to refer to the base table partition, if any. We fix and simplify the test api for reading tablet replicas to read from the base table.	2025-07-01 13:20:19 +03:00
Michael Litvak	018b61f658	tablets: allocator: create co-located tables in a single operation Co-located base and child tables may be created together in a single operation. The tablet allocator in this case needs to handle them together and not each table independently, because we need to have the base schema and tablet map when creating the child tablet map. We do this by registering the tablet allocator to the migration notification on_before_create_column_families that announces multiple new tables, and there we allocate tablets for all the new base tables, and for the new child tables we create their maps from the base tables, which are either a new table or an existing one.	2025-07-01 13:20:19 +03:00
Michael Litvak	2d0ec1c20a	alternator: prepare all new tables in a single announcement When creating base and view tables in alternator, they are created in a single operation, so use a single announcement for creating multiple tables in a single operation instead of announcing each table separately. This is needed because when we create base tables and local indexes we need to make them co-located, so we need to allocate tablets for them together.	2025-07-01 13:20:18 +03:00
Michael Litvak	05ffcefd50	migration_manager: add notification for creating multiple tables Add prepare_new_column_families_announcement for preparing multiple new tables that are created in a single operation. A listener can receive a notification when multiple tables are created. This is useful if the listener needs to have all the new tables, and not work on each new table independently. For example, if there are dependencies between the new tables.	2025-07-01 13:20:18 +03:00
Michael Litvak	064ac25ff9	tablets: read_tablet_transition_stage: read from base table When reading tablet information from system.tablets we need to read it from the base table, if exists.	2025-07-01 13:20:18 +03:00
Michael Litvak	ff9a3c9528	storage service: allow repair request only on base tables Currently, tablet repair runs only on base tables, and not on derived co-located tables. If repair is requested for a non base table throw an error since the operation won't have the intended results.	2025-07-01 13:20:18 +03:00
Michael Litvak	aa990a09c1	tablets: keyspace_rf_change: apply on base table Generate keyspace_rf_change transitions only on base tables, because in a group of co-located tablets their tablet map is shared with the base table.	2025-07-01 13:20:18 +03:00
Michael Litvak	602fa84907	storage service: generate tablet migration updates on base tables When writing transition updates to a tablet map we must do so on a base table. A table that is co-located with a base table doesn't have it's own tablet map in the tablets table, but it only points to the base table map. By writing to the base table, the tablet migration will be applied for the entire co-location group. We add a small helper in storage_service that creates a tablet mutation builder for the base table, and use it whenever we need to write tablet mutations.	2025-07-01 13:20:18 +03:00
Michael Litvak	ddf02c9489	tablets: replace all_tables method The method all_tables in tablet_metadata is used for iterating over all tables in the tablet metadata with their tablet maps. Now that we have co-located tables we need to make the distinction on which tables we want to iterate over. In some cases we want to iterate over each group of co-located tables, treating them as one unit, and in other cases we want to iterate over all tables, doesn't matter if they are part of a co-located group and have a base table. We replace all_tables with new methods that can be used for each of the cases.	2025-07-01 13:20:18 +03:00
Michael Litvak	255ca569e3	tablets: split when all co-located tablets are ready For a group of co-located tablets, they must be split together atomically, so finalize tablet split only when all tablets in the group are ready.	2025-07-01 13:20:18 +03:00
Michael Litvak	0dcb9f2ed6	tablets: load balancer: sizing plan for table groups We update the sizing plan to work with table groups instead of single tables, using the base table as a representative of a table group. The resize decision is made based on the combined per-table tablet hints, and considering the size of all tables in the group. We calculate the average tablet size of all tablets in the group and compare it with the target tablet size. The target tablet size is changed to be some function of the group size, because we may want to have a lower target tablet size when we have multiple co-located tablets, in order to reduce the migration size.	2025-07-01 13:20:18 +03:00
Michael Litvak	ac5f4da905	tablets: load balancer: handle co-located tablets Tablets of co-located tables are always co-located and migrated together, so they are considered as an atomic unit for the tablets load balancer. We change the load balancer to work with table groups as migration candidates instead of single tables, using the base table of a group as a representative of the group. For the purpose of load calculations, a group of co-located tablets is considered like a single tablet, because their combined target tablet sizes is the same as a single tablet.	2025-07-01 13:20:18 +03:00
Michael Litvak	3db8f6fd37	tablets: allocate co-located tablets When allocating tablets for a new table, add the option to create a co-located tablet map with an existing base table. The co-located tablet map is created with the base_table value set.	2025-07-01 13:20:18 +03:00
Michael Litvak	6bed9d3cfe	tablets: handle migration of co-located tablets When handling tablet transition for a group of co-located tables, maintain co-location by applying each transition operation (streaming, cleanup, repair) on all tablets in the group in a synchronized way. handle_tablet_migration is changed to work on groups of co-located tablets instead of single tablets. Each transition step is handled by applying its operation on all the tablets in the group. The tablet map of co-located tablets is shared, so we need to read and write only the tablet map of the base table.	2025-07-01 13:20:18 +03:00
Michael Litvak	11f045bb7c	storage service: add repair colocated tablets rpc add a new RPC repair_colocated_tablets which is similar to the RPC tablet_repair, but instead of repairing a single tablet it takes a set of co-located tablets, repairs them and returns a shared repair_time result. This is useful because the way co-located tablets are represented doesn't allow to repair tablets independently but only as a group operation, and the repair_time which is stored in the tablet map is shared with the entire co-location group. But when repairing a group of co-located tablets we may require a different behavior, especially considering that co-located tablets are derived tablets of a special type. For example, we may want to skip running repair on CDC tablets when repairing the base table. The new RPC and the storage service function repair_colocated_tablets allow the flexibility to implement different strategies when repairing co-located groups. Currently the implementation is simply to repair each tablet and return the minimum repair_time as the shared repair time.	2025-07-01 13:20:18 +03:00
Yaron Kaikov	fd0e044118	Update ScyllaDB version to: 2025.4.0-dev	2025-07-01 11:33:20 +03:00
Jenkins Promoter	94d7c22880	Update pgo profiles - aarch64	2025-07-01 11:33:20 +03:00
Jenkins Promoter	7531fc72a6	Update pgo profiles - x86_64	2025-07-01 11:33:20 +03:00
Nadav Har'El	e12ff4d3ab	Merge 'LWT: use tablet_metadata_guard' from Petr Gusev This PR is a step towards enabling LWT for tablet-based tables. It pursues several goals: * Make it explicit that the tablet can't migrate after the `cas_shard` check in `selec_statement/modification_statement`. Currently, `storage_proxy::cas` expects that the client calls it on a correct shard -- the one which owns the partition key the LWT is running on. There reasons for that are explained in [this commit](`f16e3b0491 (diff-1073ea9ce4c5e00bb6eb614154f523ba7962403a4fe6c8cd877d1c8b73b3f649)`) message. The statements check the current shard and invokes `bounce_to_shard` if it's not the right one. However , the erm strong pointer is only captured in `storage_proxy::cas` and until that moment there is no explicit structure in the code which would prevent the ongoing migrations. In this PR we introduce such stucture -- `erm_handle`. We create it before the `cas_check` and pass it down to `storage_proxy::cas` and `paxos_response_handler`. * Another goal of this PR is an optimization -- we don't want to hold erm for the duration of entire LWT, unless it directly affects the current tablet. The is a `tablet_metadata_guard` class which is used for long running tablet operations. It automatically switches to a new erm if the topology change represented by the new erm doesn't affect the current tablet. We use this class in `erm_handle` if the table uses tablets. Otherwise, `erm_handle` just stores erm directly. * Fixes [shard bouncing issue in alternator](https://github.com/scylladb/scylladb/issues/17399) Backport: not needed (new feature). Closes scylladb/scylladb#24495 * github.com:scylladb/scylladb: LWT: make cas_shard non-optional in sp::cas LWT: create cas_shard in select_statement LWT: create cas_shard in modification and batch statements LWT: create cas_shard in alternator LWT: use cas_shard in storage_proxy::cas do_query_with_paxos: remove redundant cas_shard check storage_proxy: add cas_shard class sp::cas_shard: rename to get_cas_shard token_metadata_guard: a topology guard for a token tablet_metadata_guard: mark as noncopyable and nonmoveable	2025-07-01 11:33:20 +03:00
Gleb Natapov	a221b2bfde	gossiper: do not assume that id->ip mapping is available in failure_detector_loop_for_node failure_detector_loop_for_node may be started on a shard before id->ip mapping is available there. Currently the code treats missing mapping as an internal error, but it uses its result for debug output only, so lets relax the code to not assume the mapping is available. Fixes #23407 Closes scylladb/scylladb#24614	2025-07-01 11:33:20 +03:00
Pavel Emelyanov	26c7f7d98b	Merge 'encryption_at_rest_test: Fix some spurious errors' from Calle Wilund Fixes #24574 * Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert * Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free. Closes scylladb/scylladb#24633 * github.com:scylladb/scylladb: encryption_at_rest_test: Add exception handler to ensure proxy stop encryption: Ensure stopping timers in provider cache objects	2025-07-01 11:33:20 +03:00
Avi Kivity	6aa71205d8	repair: row_level: unstall to_repair_rows_on_wire() destroying its input to_repair_rows_on_wire() moves the contents of its input std::list and is careful to yield after each element, but the final destruction of the input list still deals with all of the list elements without yielding. This is expensive as not all contents of repair_row are moved (_dk_with_hash is of type lw_shared_ptr<const decorated_key_with_hash>). To fix, destroy each row element as we move along. This is safe as we own the input and don't reference row_list other than for the iteration. Fixes #24725. Closes scylladb/scylladb#24726	2025-07-01 11:33:19 +03:00
Pavel Emelyanov	6826856cf8	Merge 'test.py: Fix start 3rd party services' from Andrei Chekun Move 3rd party services starting under `try` clause to avoid situation that main process is collapses without going stopping services. Without this, if something wrong during start it will not trigger execution exit artifacts, so the process will stay forever. This functionality in 2025.2 and can potentially affect jobs, so backport needed. Closes scylladb/scylladb#24734 * github.com:scylladb/scylladb: test.py: use unique hostname for Minio test.py: Catch possible exceptions during 3rd party services start	2025-07-01 11:33:19 +03:00
Anna Stuchlik	9234e5a4b0	doc: add the SBOM page and the download link This commit migrates the Software Bill Of Materials (SBOM) page added to the Enterprise docs with https://github.com/scylladb/scylla-enterprise/pull/5067. The only difference is the link to the SBOM files - it was Enterprise SBOM in the Enterprise docs, while here is a link to the ScyllaDB SBOM. It's a follow-up of migration to Source Avalable and should be backported to all Source Available versions - 2025.1 and later. Fixes https://github.com/scylladb/scylladb/issues/24730 Closes scylladb/scylladb#24735	2025-07-01 11:33:19 +03:00
Michael Litvak	c74cbca7cb	tablets: save and read tablet metadata of co-located tables Update the tablet metadata save and read methods to work with tablet maps of co-located tables. The new function colocated_tablet_map_to_mutation is used to generate a mutation of a co-located table to system.tablets. It creates a static row with the base_table column set with the base table id. The function save_tablet_metadata is updated to use this function for co-located tables. When reading tablet metadata from the table, we handle the new case of reading a co-located table. We store the co-located tables relationships in the tablet_metadata_builder's `colocated_tables` map, and process it in on_end_of_stream. The reason we defer the processing is that we want to set all normal tablet maps first, to ensure the base tablet map is found when we process a co-located table.	2025-07-01 10:29:59 +03:00
Michael Litvak	ddfe5dfb6b	tablets: represent co-located tables in tablet metadata Modify tablet_metadata to be able to represent co-located tables. The new method set_colocated_table adds to tablet_metadata a table which is co-located with another table. A co-located table shares the tablet map object with the base table, so we just create a copy of the shared tablet map pointer and store it as the co-located table's tablet map. Whenever a tablet map is modified we update the pointer for all the co-located tables accordingly, so the tablet map remains shared. We add some data structures to tablet_metadata to be able to work with co-located table groups efficiently: * `_table_groups` maps every base table to all tables in its co-location group. This is convenient for iterating over all table groups, or finding all tables in some group. * `_base_table` maps a co-located table to its base table.	2025-07-01 10:29:59 +03:00
Michael Litvak	4777444024	tablets: add base_table column to system.tablets Add a new column base_table to the system.tablets table. It can be set to point to another table to indicate that the tablets of this table are co-located with the tablets of the base table. When it's set, we don't store other tablet information in system.tablets and in the in-memory tablet map object for this table, and we need to refer instead to the base table tablet information. The method get_tablet_map always returns the base tablet map.	2025-07-01 10:29:59 +03:00
Michael Litvak	4e2742a30b	docs: update system.tablets schema The schema of system.tablets in the docs is outdated. replace it with the current schema.	2025-07-01 10:29:59 +03:00
Dawid Mędrek	d4315e4fae	cql3/statements/describe_statement.cc: Don't copy descriptions In an upcoming commit, `cql3::description` is going to become a move-only type. These changes are a prerequisite for it: we get rid of all places in the file where we copy its instances and start moving them instead.	2025-06-30 19:12:14 +02:00
Dawid Mędrek	9472da3220	cql3: Use managed_bytes instead of bytes in DESCRIBE This is a prerequiste for a following commit. We want to move towards using non-contiguous memory chunks to avoid making large allocations. This commit does NOT change the behavior of Scylla at all. The rows corresponding to the result of a DESCRIBE statement are represented by an instance of `result_set`. Before these changes, we encoded descriptions using `bytes` and then passed them into a `result_set` using its method `add_row`. What it does is turn the instances of `bytes` into instances of `managed_bytes` and append them at the end of its internal vector. In these changes, we do it on our own and use another overload of the method.	2025-06-30 19:12:14 +02:00
Dawid Mędrek	9cc3d49233	utils/managed_string.hh: Introduce managed_string and fragmented_ostringstream Currently, we use `managed_bytes` to represent fragmented sequences of bytes. In some cases, the type corresponds to generic bytes, while in some other cases -- to strings of actual text. Because of that, it's very easy to get confused what use `managed_bytes` serve in a specific piece of code. We should avoid it. In this commit, we're introducing basic wrappers over `managed_bytes` and `bytes_ostream` with a promise that they represent UTF-8-encoded strings. The interface of those types are pretty basic, but they should be sufficient for the most common use: filling a stream with characters and then extracting a fragmented buffer from it.	2025-06-30 19:12:08 +02:00
Calle Wilund	8d37e5e24b	encryption_at_rest_test: Add exception handler to ensure proxy stop If boost test is run such that we somehow except even in a test macro such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy used, causing a use after free.	2025-06-30 11:36:38 +00:00
Calle Wilund	ee98f5d361	encryption: Ensure stopping timers in provider cache objects utils::loading cache has a timer that can, if we're unlucky, be runnnig while the encryption context/extensions referencing the various host objects containing them are destroyed in the case of unit testing. Add a stop phase in encryption context shutdown closing the caches.	2025-06-30 11:36:38 +00:00
Anna Stuchlik	b61641cf57	doc: remove support for Ubuntu 20.04 Fixes https://github.com/scylladb/scylladb/issues/24564 Closes scylladb/scylladb#24565	2025-06-30 12:33:29 +02:00
Evgeniy Naydanov	8c981354a7	test.py: dtest: make auth_test.py run using test.py As a part of the porting process, remove unused imports and markers, remove non-next_gating tests and tests marked with `required_features("!consistent-topology-changes")` marker. Remove `test_permissions_caching` test because it's too flaky when running using test.py Also, make few time execution optimizations: - remove redundant `time.sleep(10)` - use smaller timeouts for CQL sessions Enable the test in suite.yaml (run in dev mode only)	2025-06-30 10:16:36 +00:00
Evgeniy Naydanov	e30e2345b7	test.py: dtest: rework wait_for_any_log() Make `wait_for_any_log()` function to work closer to the original dtest's version: use `ScyllaLogFile.grep()` method instead of the usage of `ScyllaNode.wait_log_for()` with a small timeout to have at least one try to find. Also, add `max_count` argument to `.grep()` method for the optimization purpose.	2025-06-30 10:16:36 +00:00
Evgeniy Naydanov	b5d44c763d	test.py: dtest: add support for bootstrap parameter for new_node Technically, `new_node()`'s `bootstrap` parameter used to mark a node as a seed if it's False. In test.py, seeds parameter passed on start of a node, so, save it as `ScyllaNode.bootstrap` attribute to use in `ScyllNode.start()` method.	2025-06-30 10:16:36 +00:00
Evgeniy Naydanov	d0d2171fa4	test.py: dtest: add generate_cluster_topology() function Copy generate_cluster_topology() function from tools/cluster_topology.py module.	2025-06-30 10:16:36 +00:00
Evgeniy Naydanov	28d9cdef1b	test.py: dtest: add ScyllaNode.set_configuration_options() method Implement the method using slightly modified `set_configuration_options()` method of `ScyllaCluster`.	2025-06-30 10:16:36 +00:00
Evgeniy Naydanov	a1ce3aed44	test.py: pylib/manager_client: support batch config changes Modify ManagerClient.server_update_config() method to change multiple config options in one call in addition to one `key: value` pair. All internal machinery converted to get a values dict as a parameter. Type hints were adjusted too.	2025-06-30 10:16:36 +00:00
Evgeniy Naydanov	ce9fc87648	test.py: dtest: copy unmodified auth_test.py	2025-06-30 10:06:32 +00:00
Evgeniy Naydanov	702409f7b2	test.py: dtest: add missed markers to pytest.ini `exclude_errors` and `cluster_options` are used in `audit_test.py`	2025-06-30 10:06:32 +00:00
Andrei Chekun	c6c3e9f492	test.py: use unique hostname for Minio To avoid situation that port is occupied on localhost, use unique hostname for Minio	2025-06-30 12:03:06 +02:00
Andrei Chekun	0ca539e162	test.py: Catch possible exceptions during 3rd party services start With this change if something will go wrong during starting services, they are still will be shuted down on the finally clause. Without it can hang forever	2025-06-30 12:00:23 +02:00
Petr Gusev	35aba76401	LWT: make cas_shard non-optional in sp::cas We also make sp::cas_shard function local since it's now not used directly by sp clients.	2025-06-30 10:37:33 +02:00
Petr Gusev	3d262d2be8	LWT: create cas_shard in select_statement In this commit we create cas_shard in select_statement and pass it to the sp::query_result function.	2025-06-30 10:37:33 +02:00
Petr Gusev	736fa05b17	LWT: create cas_shard in modification and batch statements We create cas_shard before the shard check to protect against concurrent tablet migrations.	2025-06-30 10:37:33 +02:00
Petr Gusev	7e64852bfd	LWT: create cas_shard in alternator We create cas_shard instance in shard_for_execute(). This implies that the decision about the correct shard was made using the specific token_metadata_guard, and it remains valid only as long as the guard is held. When forwarding a request to another shard, we keep the original cas_shard alive. This ensures that the target shard remains a valid owner for the given token. Fixes scylladb/scylladb#17399	2025-06-30 10:37:33 +02:00
Petr Gusev	deb7afbc87	LWT: use cas_shard in storage_proxy::cas Take cas_shard parameter in sp::cas and pass token_metadata_guard down to paxos_response_handler. We make cas_shard parameter optional in storage_proxy methods to make the refactoring easier. The sp::cas method constructs a new token_metadata_guard if it's not set. All call sites pass null in this commit, we will add the proper implementation in the next commits.	2025-06-30 10:33:17 +02:00
Petr Gusev	94f0717a1e	do_query_with_paxos: remove redundant cas_shard check The same check is done in the sp::cas method.	2025-06-30 10:33:17 +02:00
Petr Gusev	43c4de8ad1	storage_proxy: add cas_shard class The sp::cas method must be called on the correct shard, as determined by sp::cas_shard. Additionally, there must be no asynchronous yields between the shard check and capturing the erm strong pointer in sp::cas. While this condition currently holds, it's fragile and easy to break. To address this, future commits will move the capture of token_metadata_guard to the call sites of sp::cas, before performing the shard check. As a first step, this commit introduces a cas_shard class that wraps both the target shard and a token_metadata_guard instance. This ensures the returned shard remains valid for the given tablet as long as the guard is held. In the next commits, we’ll pass a cas_shard instance to sp::cas as a separate parameter.	2025-06-30 10:33:17 +02:00
Nadav Har'El	7db5e9a3e9	test/cqlpy: reproducer for decimal parsing with very high exponent This patch adds tests reproducing issue #24581, where Scylla incorrectly parsed "decimal"-type literals in CQL with very high exponents, near or above the 32-bit limit. For example, 1.1234e-2147483647 was incorrectly read as 1.1234E+2147483649, while it should be (as we explain in comments in the test) an error. The tests in this patch failed (in multiple checks) before #24581 was fixed, and pass after it was fixed. These tests all pass on Cassandra 3, confirming our understanding on the limits of "decimal" to be correct. But they fail on Cassandra 4 and 5 due to a regression https://issues.apache.org/jira/browse/CASSANDRA-20723 in Cassandra, that mistakenly limited "decimal" exponents to just 309. Refs #24581 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24646	2025-06-30 10:37:13 +03:00
Anna Stuchlik	b7683d0eba	doc: remove duplicated content This commit removes the Non-Reserved CQL Keywords and Reserved CQL Keywords pages-keyword as that content is already covered on the Appendices page. Redirections are added to avoid 404s for the removed pages. In addition, the Appendices page title is extended with "Reserved CQL Keywords and Types" to help users understand what those appendices are about. Fixes https://github.com/scylladb/scylladb/issues/24319 Closes scylladb/scylladb#24320	2025-06-30 10:30:13 +03:00
Andrzej Jackowski	c8ab5928a3	test: audit: synchronize audit syslog server In audit tests, UnixDatagramServer is used to receive audit logs. This commit introduces a synchronization between the logs receiver and a function that reads already received logs. Without this, there was a race condition that resulted in test failures (e.g., audit logs were missing during assertion check).	2025-06-30 09:19:26 +02:00
Andrzej Jackowski	fcd88e1e54	docs: audit: update syslog audit format to the current one The documentation of the syslog audit format was not updated when scylladb#23099 and earlier audit log changes were introduced. This commit includes the missing update.	2025-06-30 09:19:25 +02:00
Andrzej Jackowski	422b81018d	audit: bring back commas to audit syslog When the audit syslog format was changed in scylladb#23099, commas were removed. This made the syslog format inconsistent, as LOGIN audit logs contained commas while other audit logs did not. Additionally, the lack of commas was not aligned with the audit documentation. This commit brings back the use of commas in the audit syslog format to ensure consistency across all types of audit logs. Fixes: scylladb#24410	2025-06-30 09:19:25 +02:00
Botond Dénes	ee6d7c6ad9	test/boost/memtable_test: only inject error for test table Currently the test indiscriminately injects failures into the flushes of any table, via the IO extension mechanism. The tests want to check that the node correctly handles the IO error by self isolating, however the indiscriminate IO errors can have unintended consequences when they hit raft, leading to disorderly shutdown and failure of the tests. Testing raft's resiliency to IO errors if of course worth doing, but it is not the goal of this particular test, so to avoid the fallout, the IO errors are limited to the test tables only. Fixes: https://github.com/scylladb/scylladb/issues/24637 Closes scylladb/scylladb#24638	2025-06-30 10:08:49 +03:00
Avi Kivity	07c5edcc30	tools: add patchelf utility We use patchelf to rewrite the dynamic loader (known as the interpreter) of the binaries we ship, so we can point to our shipped dynamic loader, which is compatible with our binaries, rather than rely on the distribution's dynamic loader, which is likely to be incompatible. Upstream patchelf losing compatibity [1] with Linux 5.17 and below. This change was also picked up by Fedora 42, so we cannot update the toolchain to that distribution until we have an alternative. Here we add a minimal patchelf alternative. It was mostly written by Claude. It is minimal in that it only supports --set-interpreter and --print-interpreter, and works well enough for our needs. We still use the original patchelf for --remove-rpath; this reduces our maintenance needs. [1] `43b75fbc9f` [2] `4b015255d1` Closes scylladb/scylladb#24695	2025-06-30 07:24:05 +03:00
Avi Kivity	e2cda38b0f	Merge 'alternator: improve, document and test table/index name lengths' from Nadav Har'El Whereas DynamoDB limits the names of tables, LSIs and GSIs to 255 characters each, Alternator currently has different (and lower) limitations: 1. A table name must be up to 222 characters. 2. For a GSI, the sum of the table's and GSI's name length, plus 1, must be up to 222 characters. 3. For an LSI, the sum of the table's and LSI's name length, plus 2, must be up to 222 characters. The first patch documents these existing limitations, improves their testing, and fixes a tiny bug found by one of the tests (where UpdateTable adding a GSI's limit testing is off by one). The second patch unfortunately shows with a reproducer (issue #24598) this limit of 222 is problematic and we may need to lower it: If a user creates a table of length 222 and then enables Alternator streams, Scylla shuts down on an IO error. This will need to be fixed later, but at least this patch properly documents the existing behavior. No need to backport this patch - it is a very minor improvement that it is unlikely users care about and there is no potential for harm. Closes scylladb/scylladb#24597 * github.com:scylladb/scylladb: test/alternator: reproducer for streams bug with long table name alternator: improve, document and test table/index name lengths	2025-06-29 18:53:48 +03:00
Avi Kivity	b33dd2bd7d	Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery. Add a full-stack test which checks that rows with bad keys are correctly handled. Fixes: https://github.com/scylladb/scylladb/issues/24489 The bug is present in all versions, has to be backported to all supported versions. Closes scylladb/scylladb#24492 * github.com:scylladb/scylladb: test/boost/sstable_datafile_test: add test for corrupt data sstables/mx/writer: handler rows with empty keys test/lib/cql_assertions: introduce columns_assertions sstables: add corrupt_data_handler to sstables::sstables tools/scylla-sstable: make large_data_handler a local db: introduce corrupt_data_handler mutation: introduce frozen_mutation_fragment_v2 mutation/mutation_partition_view: read_{clustering,static}_row(): return row type mutation/mutation_partition_view: extract de-ser of {clustering,static} row idl-compiler.py: generate skip() definition for enums serializers idl: extract full_position.idl from position_in_partition.idl db/system_keyspace: add apply_mutation() db/system_keyspace: introduce the corrupt_data table	2025-06-29 18:18:36 +03:00
Avi Kivity	48d9f3d2e3	Merge 'mutation: check key of inserted rows' from Botond Dénes Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations. Fixes: https://github.com/scylladb/scylladb/issues/24506 Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters. Closes scylladb/scylladb#24497 * github.com:scylladb/scylladb: mutation: check key of inserted rows compound: optimize is_full() for single-component types	2025-06-29 18:10:17 +03:00
Pavel Emelyanov	ef396ecf7a	api: Reserve resulting vector with schema versions The get_schema_versions handler gets unordered_map from storage service, then converts it to API returning type, which is a vector. This vector can be reserved, the final number of elements is known in advance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24715	2025-06-29 14:37:45 +03:00
Nadav Har'El	e7257b1393	test/alternator: make "run" script use only_rmw_uses_lwt Originally (since commit `c3da9f2`), Alternator's functional test suite (test/alternator) ran "always_use_lwt" write isolation mode. The original thinking was that we need to exercise this more difficult mode and it's the most important mode. This mode was originally chosen in test/alternator/run. However, starting with commit `76a766c` (a year ago), test.py no longer runs test/alternator/run. Instead, it runs Scylla itself, and the options for running Scylla appear in test/alternator/suite.yaml, and accidentally the write isolation mode only_rmw_uses_lwt was chosen there. The purpose of this patch is to reconcile this difference and use the same mode in test.py (which CI is using) and test/alternator/run (which is only used by some developers, during development). I decided to have this patch change test/alternator/run to use only_rmw_uses_lwt. As noted above, this is anyway how all Alternator tests have been running in CI in the past year (through test.py). Also, the mode only_rmw_uses_lwt makes running the Alternator test suite slightly faster (52 seconds instead of 58 seconds, on my laptop) which is always nice for developers. This patch changes nothing for testing in CI - only manual runs through test/alternator/run are affected. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-06-29 13:58:58 +03:00
Nadav Har'El	3fd2493bc9	test/alternator: improve tests for write isolation modes Before this patch, we had in test_condition_expression.py and test_update_expression.py some rudimentary tests that the different write isolation modes behave as expected. Basically, we wanted to test that read-modify-write (RMW) operations are recognized and forbidden in forbid_rmw mode, but work correctly in the three other modes. We only check non-concurrent writes, so the actual write isolation is NOT checked, just the correctness of non-concurrent writes. However, since these tests were split across several files, and many of the tests just ran other existing tests in different write isolation modes, it was hard to see what exactly was being tested, and what was missed. And indeed we missed checking some RMW operations, such as requests with ReturnValues, requests with the older Expected or AttributeUpdates (only the newer ConditionExpression and UpdateExpression were tested), and ADD and DELETE operations in UpdateExpression. So this patch replaces the existing partial tests with a new test file test_write_isolation.py dedicated to testing all kinds of RMW operations in one place, and how they don't work in forbid_rmw and do work in the other modes. Writing all these tests in one place made it easier to create a really exhaustive test of all the different operations and optional parameters, and conversely - make sure that we don't test unnecessary things such as different ConditionExpression expressions (we already have 1800 lines of tests for ConditionExpression, and the actual content of the condition is unrelated to write isolation modes). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-06-29 13:58:38 +03:00
Nadav Har'El	50d370f06e	test/alternator: reproducer for streams bug with long table name The two tests in this patch reproduce issue #24598: When enabling Alternator streams on an Alternator table with a very long name, such as the maximum allowed name length 222, the result is an I/O error and a Scylla shutdown. The two tests are currently marked "skip", otherwise they would crash the Scylla being tested. Refs #24598 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-06-29 11:40:55 +03:00
Nadav Har'El	0ce0b2934f	alternator: improve, document and test table/index name lengths Whereas DynamoDB limits the names of tables, LSIs and GSIs to 255 characters each, Alternator currently has different (and lower) limitations: 1. A table name must be up to 222 characters. 2. For a GSI, the sum of the table's and GSI's name length, plus 1, must be up to 222 characters. 3. For an LSI, the sum of the table's and LSI's name length, plus 2, must be up to 222 characters. These specific limitations were never documented, so in this patch we add this information to docs/alternator/compatibility.md. Moreover, these limitations where only partially tested, so in this patch we add testing for more cases that we forgot to check - such as length of LSI names (only GSI were checked before this patch), or adding a GSI to an existing table. It is important to check all these corner cases because there is a risk that if we attempt to create a table without checking its length, we can end up with an I/O error that brings down Scylla. In one case - UpdateTable adding a GSI to an existing table - the new test exposed a trivial bug: Because UpdateTable wants to verify the new GSI doesn't have the same name as an existing LSI, it mistakenly applied the LSI's length name limit instead of the GSI's name length limit, which is one byte less than it should be. So this patch fixes this trivial bug as well. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-06-29 11:40:55 +03:00
Emil Maskovsky	c6307aafd5	test.py: handle cancellation gracefully to avoid TypeError Previously, if test execution was cancelled, `run_all_tests()` could return `None`. This caused a `TypeError` when the result was unconditionally unpacked into `total_tests_pytest, failed_pytest_tests`. This commit updates the code to handle the cancellation appropriately, preventing the confusing `TypeError` exception and ensuring clean cancellation behavior. Closes scylladb/scylladb#24624	2025-06-27 20:14:35 +03:00
Pavel Emelyanov	23d86ede72	Merge 'audit: introduce debug level logs on happy path' from Dario Mirovic Audit component defines `audit` logger which it uses only for `error` and `info` logs, regarding `audit` module initialization and errors during audit log writing. This change introduces `debug` level logs on the happy path of audit log writes. Fixes: https://github.com/scylladb/scylladb/issues/23773 No backport needed - this is a small quality-of-life improvement. Closes scylladb/scylladb#24658 * github.com:scylladb/scylladb: audit: change audit test logger level to `debug` audit: introduce debug level logs on happy path	2025-06-27 20:10:54 +03:00
Anna Stuchlik	2367330513	doc: remove OSS mention from the SI notes This commit removes a confusing reference to an Open Source version form the Local Secondary Indexes page. Fixes https://github.com/scylladb/scylladb/issues/24668 Closes scylladb/scylladb#24673	2025-06-27 20:07:51 +03:00
Anna Stuchlik	7537f5f260	doc: fix the headings in the Admin Guide This commit fixes incorrect headings in the Admin Guide and the files that are included in that guide. The purpose is to properly organize the content and improve the search, as well as prevent potential build problems caused by a poor heading organization. Fixes https://github.com/scylladb/scylladb/issues/24441 Closes scylladb/scylladb#24700	2025-06-27 20:07:09 +03:00
Dario Mirovic	ec6249b581	audit: change audit test logger level to `debug` Audit module tests should show the `debug` level messages. This change makes audit_test.py `audit` module log level to `debug`. Closes scylladb/scylladb#23773	2025-06-27 16:27:33 +02:00
Dario Mirovic	666364f651	audit: introduce debug level logs on happy path Audit component defines `audit` logger which it uses only for `error` and `info` logs, regarding `audit` module initialization and errors during audit log writing. This change introduces `debug` level logs on the happy path of audit log writes. Ref: scylladb/scylladb#23773	2025-06-27 16:27:27 +02:00
Botond Dénes	495f607e73	test/cluster/test_read_repair: write 100 rows in trace test This test asserts that a read repair really happened. To ensure this happens it writes a single partition after enabling the database_apply error injection point. For some reason, the write is sometimes reordered with the error injection and the write will get replicated to both nodes and no read repair will happen, failing the test. To make the test less sensitive to such rare reordering, add a clustering column to the table and write a 100 rows. The chance of all 100 of them being reordered with the error injection should be low enough that it doesn't happen again (famous last words). Fixes: #24330 Closes scylladb/scylladb#24403	2025-06-27 16:23:08 +03:00
Pavel Emelyanov	4c0154f156	Merge 'test.py: enhance allure reporting' from Andrei Chekun Add run ID for process output file to be not overwritten in the next case: first run failed, second passed. They are using the same name, so the second run will overwrite and delete the file. This will help to investigate in case of C++ test fails Add attaching Scylla log files to allure report in case test failed. This is an alternative for link in JUnit report that exists in CI. That change will help to investigate the cluster tests fails. Example can be found in the failed [job](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2980/allure/). Backport is not needed, this is only framework enhancements Closes scylladb/scylladb#24677 * github.com:scylladb/scylladb: test.py: Attach node logs in allure report in case of fail test.py: Add run id to the boost output file	2025-06-27 16:22:03 +03:00
Botond Dénes	e715a150b9	tools/scylla-nodetool: backup: add --move-files parameter Allow opting in for backup to move the files instead of copying them. Fixes: https://github.com/scylladb/scylladb/issues/24372 Closes scylladb/scylladb#24503	2025-06-27 16:21:39 +03:00
Piotr Dulikowski	9d70e7a067	Merge 'docs: document the new recovery procedure' from Patryk Jędrzejczak We replace the documentation of the old recovery procedure with the documentation of the new recovery procedure. The new recovery procedure requires the Raft-based topology to be enabled, so to remove the old procedure from the documentation, we must assume users have the Raft-based topology enabled. We can do it in 2025.2 because the upgrade guides to 2025.1 state that enabling the Raft-based topology is a mandatory step of the upgrade. Another reminder is the upgrade guides to 2025.2. Since we rely on the Raft-based topology being enabled, we remove the obsolete parts of the documentation. We will make the Raft-based topology mandatory in the code in the future, hopefully in 2025.3. For this reason, we also don't touch the dev docs in this PR. Fixes scylladb/scylladb#24530 Requires backport to 2025.2 because 2025.2 contains the new recovery procedure. Closes scylladb/scylladb#24583 * github.com:scylladb/scylladb: docs: rely on the Raft-based topology being enabled docs: handling-node-failures: document the new recovery procedure	2025-06-26 17:07:37 +02:00
Gleb Natapov	5f953eb092	storage_proxy: retry paxos repair even if repair write succeeded After paxos state is repaired in begin_and_repair_paxos we need to re-check the state regardless if write back succeeded or not. This is how the code worked originally but it was unintentionally changed when co-routinized in `61b2e41a23`. Fixes #24630 Closes scylladb/scylladb#24651	2025-06-26 17:06:02 +02:00
Andrei Chekun	2c726c5074	test.py: Attach node logs in allure report in case of fail Currently, allure report have no nodes logs in case of fail, this will allow to view the logs in one place without going anywhere else.	2025-06-26 15:37:33 +02:00
Piotr Dulikowski	2f7ed8b1d4	Merge 'Fix for cassandra role gets recreated after DROP ROLE' from Marcin Maliszkiewicz This patchset fixes regression introduced by `7e749cd848` when we started re-creating default superuser role and password from the config, even if new custom superuser was created by the user. Now we'll check, first with CL LOCAL_ONE if there is a need to create default superuser role or password, confirm it with CL QUORUM and only then atomically create role or password. If server is started without cluster quorum we'll skip creating role or password. Fixes https://github.com/scylladb/scylladb/issues/24469 Backport: all versions since 2024.2 Closes scylladb/scylladb#24451 * github.com:scylladb/scylladb: test: auth_cluster: add test for password reset procedure auth: cache roles table scan during startup test: auth_cluster: add test for replacing default superuser test: pylib: add ability to specify default authenticator during server_start test: pylib: allow rolling restart without waiting for cql auth: split auth-v2 logic for adding default superuser password auth: split auth-v2 logic for adding default superuser role auth: ldap: fix waiting for underlying role manager auth: wait for default role creation before starting authorizer and authenticator	2025-06-26 14:36:25 +02:00
Lakshmi Narayanan Sreethar	279253ffd0	utils/big_decimal: fix scale overflow when parsing values with large exponents The exponent of a big decimal string is parsed as an int32, adjusted for the removed fractional part, and stored as an int32. When parsing values like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32 limit, and since the scale is stored as an int32, it overflows and wraps around, losing the value. This patch fixes that the by parsing the exponent as an int64 value and then adjusting it for the fractional part. The adjusted scale is then checked to see if it is still within int32 limits before storing. An exception is thrown if it is not within the int32 limits. Note that strings with exponents that exceed the int32 range, like `0.01E2147483650`, were previously not parseable as a big decimal. They are now accepted if the final adjusted scale fits within int32 limits. For the above value, unscaled_value = 1 and scale = -2147483648, so it is now accepted. This is in line with how Java's `BigDecimal` parses strings. Fixes: #24581 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#24640	2025-06-26 15:29:28 +03:00
Patryk Jędrzejczak	203ea5d8f9	docs: rely on the Raft-based topology being enabled In 2025.2, we don't force enabling the Raft-based topology in the code, but we stated in the upgrade guides that it's a mandatory step of the upgrade to 2025.1. We also remind users to enable the Raft-based topology in the upgrade guides to 2025.2. Hence, we can rely in the the documentation on the Raft-based topology being enabled. If it is still disabled, we can just send the user to the upgrade guides. Hence: - we remove all documentation related to enabling the Raft-based topology, enabling the Raft-based schema (enabled Raft-based topology implies enabled Raft-based schema), and the gossip-based topology, - we can replace the documentation of the old manual recovery procedure with the documentation of the new manual recovery procedure (done in the previous commit).	2025-06-26 14:17:54 +02:00
Patryk Jędrzejczak	4e256182a0	docs: handling-node-failures: document the new recovery procedure We replace the documentation of the old recovery procedure with the documentation of the new recovery procedure. We can get rid of the old procedure from the documentation because we requested users to enable the Raft-based topology during upgrades to 2025.1 and 2025.2. We leave the note that enabling the Raft-based topology is required to use the new recovery procedure just in case, since we didn't force enabling the Raft-based topology in the code.	2025-06-26 14:17:50 +02:00
Andrei Chekun	156e7d2e7a	test.py: Add run id to the boost output file To avoid overwriting the output tests adding the run id to it. Previously, when first repeat failed and the second passes, because the are using the same name for the output, it will be overwritten and deleted since the second repeat passed	2025-06-26 12:51:15 +02:00
Marcin Maliszkiewicz	5e7ac34822	test: auth_cluster: add test for password reset procedure	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	0ffddce636	auth: cache roles table scan during startup It may be particularly beneficial during connection storms on startup. In such cases, it can happen that none of the user's read requests succeed, preventing the cache from being populated. This, in turn, makes it more difficult for subsequent reads to succeed, reducing resiliency against such storms.	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	67a4bfc152	test: auth_cluster: add test for replacing default superuser This test demonstrates creating custom superuser guide: https://opensource.docs.scylladb.com/stable/operating-scylla/security/create-superuser.html	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	a3bb679f49	test: pylib: add ability to specify default authenticator during server_start Sometimes we may not want to use default cassandra role for control connection, especially when we test dropping default role.	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	d9ec746c6d	test: pylib: allow rolling restart without waiting for cql Waiting for CQL requires default superuser being present in db. In some cases we may delete it and still want to do rolling restart. Additionally if we need CQL we may want to wait after restart is complete (once, and not for each node).	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	f85d73d405	auth: split auth-v2 logic for adding default superuser password In raft mode (auth-v2) we need to do atomic write after read as we give stricter consistency guarantees. Instead of patching legacy logic this commit adds different path as: - old code may be less tested now so it's best to not change it - new code path avoids quorum selects in a typical flow (passwords set) There may be a case when user deletes a superuser or password right before restarting a node, in such case we may ommit updating a password but: - this is a trade-off between quorum reads on startup - it's far more important to not update password when it shouldn't be - if needed password will be updated on next node restart If there is no quorum on startup we'll skip creating password because we can't perform any raft operation. Additionally this fixes a problem when password is created despite having non default superuser in auth-v2.	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	2e2ba84e94	auth: split auth-v2 logic for adding default superuser role In raft mode (auth-v2) we need to do atomic write after read as we give stricter consistency guarantees. Instead of patching legacy logic this commit adds different path as: - old code may be less tested now so it's best to not change it - new code path avoids quorum selects in a typical flow (roles set) This fixes a problem when superuser role is created despite having non default superuser in auth-v2. If there is no quorum on startup we'll skip creating role because we can't perform any raft operation.	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	c96c5bfef5	auth: ldap: fix waiting for underlying role manager ldap_role_manager depends on standard_role_manager, therefore it needs to wait for superuser initialization. If this is missing, the password authenticator will start checking the default password too early and may fail to create the default password if there is no default role yet. Currently password authenticator will create password together with the role in such case but in following commits we want to separate those responsibilities correctly.	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	68fc4c6d61	auth: wait for default role creation before starting authorizer and authenticator There is a hidden dependency: the creation of the default superuser role is split between the password authenticator and the role manager. To work correctly, they must start in the right order: role manager first, then password authenticator.	2025-06-26 12:28:08 +02:00
Piotr Dulikowski	62efe6616a	Merge 'mapreduce: add tablet-aware dispatching algorithm' from Andrzej Jackowski The primary motivation for this change is to reduce the time during which the Effective Replication Map (ERM) is retained by the mapreduce service. This ensures that long aggregate queries do not block topology operations. As ScyllaDB is generally transitioning towards tablets, and using tablets simplifies work dispatching, the decision was made to design the new algorithm specifically for tablets. The goal of the algorithm is to divide the work in such a way that each `tablet_replica` (that is <host, shard> pair) processes two tablets at a time. The new algorithm can be summarized as follows: 1. Prepare a tablet_replica -> partition_range mapping where the values cover the entire space. 2. For each tablet_replica, in parallel, take two partition ranges and dispatch them to the node hosting the replica. The ERM is released and re-acquired in each iteration, allowing the destination (i.e., tablet_replica) to change for each artition range (in such cases, the partition range is assigned to the appropriate tablet_replica). In step 1, the main difference compared to the old algorithm (dispatch_to_vnodes) is that partition ranges are assigned to a tablet_replica rather than just the host. In step 2, the main difference is that the work is divided into smaller batches, and the ERM is released and re-acquired for each batch. In the current implementation, each node can correctly handle every partition range, even if the mapreduce supercoordinator does not retain the ERM and the range is absent locally. This is because mapreduce_service::execute_on_this_shard creates a new pager that coordinates the partition range read, including obtaining its own ERM. However, every partition range that is absent locally is handled by shard 0. Therefore, proper routing of partition ranges is necessary to avoid shard 0 overload. This is why, in step 2, the ERM is retained during each batch processing, and the tablet_replica is refreshed for each processed range. Additionally, shard_id is added to mapreduce request. When shard_id is set, the entire partition range is handled by the specified shard. As the new tablet-aware mapreduce algorithm balances the workload across shards, shard_id ensure that the balance is preserved, even during events such as tablet splits. This patch series: - Refactors a bit mapreduce service, to facilitate having two algorithm versions (one for vnodes and one for tablets). - Implements tablet-aware dispatching algorithm. - Adds shard_id to mapreduce request and uses the information to handle requests entirely by selected shard. - Adds test_long_query_timeout_erm to verify the new functionality. Fixes: scylladb#21831 No backport, as it is rather new feature than a bugfix. Closes scylladb/scylladb#24383 * github.com:scylladb/scylladb: mapreduce: add missing comma and space in mapreduce_request operator<< mapreduce: add shard_id_hint to mapreduce request test: add test_long_query_timeout_erm mapreduce: add tablet-aware dispatching algorithm storage_proxy: make storage_proxy::is_alive public mapreduce: remove _shared_token_metadata from mapreduce_service mapreduce: move dispatching logic to dispatch_to_vnodes mapreduce: remove underscores from variable names mapreduce: move req_with_modified_pr handling to a new function mapreduce: change next_vnode lambda to get_next_partition_range function	2025-06-26 12:25:39 +02:00
Avi Kivity	947906e6fd	Merge 'Make uuid sstable generations mandatory' from Benny Halevy Before we can eradicate the numerical sstable generations, This series completes https://github.com/scylladb/scylladb/issues/20337 by disabling the use of numerical sstable generations where we can and making sure the feature is never disabled. Note that until the cluster feature is enabled in the startup process on first boot, numerical generation might be used for local system tables. Refs #24248 * Enhancement. No backport required Closes scylladb/scylladb#24554 * github.com:scylladb/scylladb: feature_service: never disable UUID_SSTABLE_IDENTIFIERS test: sstable_move_test: always use uuid sstable generation test: sstable_directory_test: always use uuid sstable generation sstables: sstable_generation_generator: set last_generation=0 by default test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation test: lib: test_env: always use uuid sstable generation test: sstable_test: always use uuid sstable generation test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation test: sstable_compaction_test: always use uuid sstable generation	2025-06-26 12:25:38 +02:00
Szymon Malewski	f28bab741d	utils/exceptions.cc: Added check for `exceptions::request_timeout_exception` in `is_timeout_exception` function. It solves the issue, where in some cases a timeout exceptions in CAS operations are logged incorrectly as a general failure. Fixes #24591 Closes scylladb/scylladb#24619	2025-06-26 12:25:38 +02:00
Pavel Emelyanov	0f5b358c47	test: Use test sched groups, not database ones Some tests want to switch between sched groups. For that there's cql-test-env facility to create and use them. However, there's a test that uses replica::database as sched groups provider, which is not nice. Fix it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24615	2025-06-26 12:25:38 +02:00
Avi Kivity	ff508ce82c	Merge 'sstables: purge SCYLLA_ASSERT from the sstable read/parse paths' from Botond Dénes Introduce `sstables::parse_assert()`, to replace `SCYLLA_ASSERT()` on the read/parse path. SSTables can get corrupt for various reasons, some outside of the database's control. A bad SSTable should not bring down the database, the parsing should simply be aborted, with as much information printed as possible for the investigation of the nature of the corruption. The newly introduced `parse_assert()` uses `on_internal_error()` under the hood, which prints a backtrace and optionally allows for aborting when on the error, to generate a coredump. Fixes https://github.com/scylladb/scylladb/issues/20845 We just hit another case of `SCYLLA_ASSERT()` triggering due to corrupt sstables bringing down nodes in the field, should be backported to all releases, so we don't hit this in the future Closes scylladb/scylladb#24534 * github.com:scylladb/scylladb: sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path sstables/exceptions: introduce parse_assert()	2025-06-26 12:25:38 +02:00
Ferenc Szili	96267960f8	logging: Add row count to large partition warning message When writing large partitions, that is: partitions with size or row count above a configurable threshold, ScyllaDB outputs a warning to the log: WARN ... large_data - Writing large partition test/test: (1200031 bytes) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db This warning contains the information about the size of the partition, but it does not contain the number of rows written. This can lead to confusion because in cases where the warning was written because of the row count being larger than the threshold, but the partition size is below the threshold, the warning will only contain the partition size in bytes, leading the user to believe the warning was output because of the partition size, when in reality it was the row count that triggered the warning. See #20125 This change adds a size_desc argument to cql_table_large_data_handler::try_record(), which will contain the description of the size of the object written. This method is used to output warnings for large partitions, row counts, row sizes and cell sizes. This change does not modify the warning message for row and cell sizes, only for partition size and row count. The warning for large partitions and row counts will now look like this: WARN ... large_data - Writing large partition test/test: (1200031 bytes/100001 rows) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db Closes scylladb/scylladb#22010	2025-06-26 12:25:38 +02:00
Yaniv Michael Kaul	198ecd8039	Do not perform blkdiscard by default on the disks during RAID setup. This is not needed on clean disks, which is often the case with cloud instances, but can be useful on bare metal servers with disks that were used before. Therefore, the default is to skip blkdiscard operation, which makes overall installation faster. If the user wishes to run it anyway, use the newly introduced --blkdiscard option of scylla_raid_setup to perform it. Note: since we either perform online discard or schedule fstrim, the (previously used) space will gradually get trimmed, this way or another. Fixes: https://github.com/scylladb/scylladb/issues/24470 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#24579	2025-06-26 12:25:38 +02:00
Piotr Dulikowski	23f0d275c8	Merge 'generic_server: fix connections semaphore config observer' from Marcin Maliszkiewicz In `ed3e4f33fd` we introduced new connection throttling feature which is controlled by uninitialized_connections_semaphore_cpu_concurrency config. But live updating of it was broken, this patch fixes it. When the temporary value from observer() is destroyed, it disconnects from updateable_value, so observation stops right away. We need to retain the observer. Backport: to 2025.2 where this feature was added Fixes: https://github.com/scylladb/scylladb/issues/24557 Closes scylladb/scylladb#24484 * github.com:scylladb/scylladb: test: add test for live updates of generic server config utils: don't allow do discard updateable_value observer generic_server: fix connections semaphore config observer	2025-06-26 12:25:38 +02:00
Sayanta Banerjee	fa1eafa166	Small grammatical changes	2025-06-26 15:52:38 +05:30
Andrzej Jackowski	ba6ed45d7f	mapreduce: add missing comma and space in mapreduce_request operator<< This change is introduced to fix the broken formating of mapreduce_request `operator<<`. Due to lack of ", " before "cmd" the output was `reductions=[...]cmd=read_command{...}` instead of `reductions=[...], cmd=read_command{...}`.	2025-06-25 19:23:07 +02:00
Andrzej Jackowski	26403df9ea	mapreduce: add shard_id_hint to mapreduce request If a partition range is not present locally, `partition_ranges_owned_by_this_shard` assigns it to shard 0, which can overload shard 0. To address this, this commit adds a `shard_id_hint` to the mapreduce request. When `shard_id_hint` is set, the entire partition range in the request is handled by the specified shard. The `shard_id_hint` is set by the new tablet-aware mapreduce algorithm, introduced in `dispatch_to_tablets`. This algorithm balances the workload across shards, so the changes in this commit ensure that load balancing is preserved, even during events such as tablet splits. Fixes: scylladb#21831	2025-06-25 19:23:07 +02:00
Andrzej Jackowski	5f31011111	test: add test_long_query_timeout_erm This test verifies the effectiveness of the mechanism for releasing ERM introduced in this patch series. In test scenario, during processing of a query in mapreduce service, reads are intentionally blocked by an injected error. However, when table uses tablets, ERM is now often released by the mapreduce service, so the topology is not blocked to the end of the request. As a result, it is possible to add a new node before the query finishes. Refs. scylladb#21831	2025-06-25 19:22:48 +02:00
Robert Bindar	6e7cab5b45	Add repository layout dev documentation This change adds an md file which gives a high level overview of the scylladb repository, the components each path contains and a basic description for each one of them. This is mainly intended for onboarding engineers to help get a mental picture when starting ramping up on Scylla concepts. Refs #22908 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#23010	2025-06-25 13:58:05 +03:00
Patryk Jędrzejczak	cc8c618356	Merge 'LWT for tablets: fix paxos state for intranode migration' from Petr Gusev This PR fixes the "intra-node tablet migration" issue from the [LWT over tablets spec](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.uk3mizf7gvs1). We make `get_replica_lock` to acquire locks on both shards to avoid races. We also implement read_repair for paxos state -- if `load_paxos_state` returns different states on two shards, we 'repair' it by choosing the values with maximum timestamp and writing the 'repaired' state to both shards. LWT for tablets is not enabled yet. It requires migrating paxos state to colocated tablets, which is blocked on [this PR](https://github.com/scylladb/scylladb/pull/22906). Regarding testing: * We could possibly arrange a test case for the locking commit through some error injection magic. We'll return to this when LWT for tablets is enabled. * We can't think of a clear test case for the read_repair commit. Any suggestions are welcome (@gleb-cloudius). Backport: no need, since it's a new feature. Closes scylladb/scylladb#24478 * https://github.com/scylladb/scylladb: paxos_state: read repair for intranode_migration paxos_state: fix get_replica_lock for intranode_migration	2025-06-25 11:08:39 +02:00
Sergey Zolotukhin	0d7de90523	Fix regexp in `check_node_log_for_failed_mutations` The regexp that was added in https://github.com/scylladb/scylladb/pull/23658 does not work as expected: `TRACE`, `INFO` and `DEBUG` level messages are not ignored. This patch corrects the pattern to ensure those log levels are excluded. Fixes scylladb/scylladb#23688 Closes scylladb/scylladb#23889	2025-06-25 12:00:16 +03:00
Anna Stuchlik	592d45a156	doc: remove references to Open Source from README This commit removes the references to ScyllaDB Open Source from the README file for documentation. In addition, it updates the link where the documentation is currently published. We've removed Open Source from all the documentation, but the README was missed. This commit fixes that. Closes scylladb/scylladb#24477	2025-06-25 11:38:46 +03:00
Michał Chojnowski	cace55aaaf	test_sstable_compression_dictionaries_basic.py: fix a flaky check test_dict_memory_limit trains new dictionaries and checks (via metrics) that the old dictionaries are appropriately cleaned up. The problem is that the cleanup is asynchronous (because the lifetimes are handled by foreign_ptr, which sends the destructor call to the owner shard asynchronously), so the metrics might be checked a few milliseconds before the old dictionary is cleaned up. The dict lifetimes are lazy on purpose, the right thing to do is to just let the test retry the check. Fixes scylladb/scylladb#24516 Closes scylladb/scylladb#24526	2025-06-25 11:30:28 +03:00
Amnon Heiman	51cf2c2730	api/failure_detector.cc: stream endpoints Previously, get_all_endpoint_states accumulated all results in memory, which could lead to large allocations when dealing with many endpoints. This change uses the stream_range_as_array helper to stream the results. Fixes #24386 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes scylladb/scylladb#24405	2025-06-25 11:28:37 +03:00
Guy Shtub	71ba1f8bc9	docs: update third party driver list with Exandra Elixir driver Closes scylladb/scylladb#24260	2025-06-25 11:27:03 +03:00
Kefu Chai	e212b1af0c	build: add p11-kit's cflags to user_cflags instead of args.user_cflags Fix an issue introduced in commit `083f7353` where p11-kit's compiler flags were incorrectly added to `args.user_cflags` instead of `user_cflags`. This created the following problem: When using CMake generation mode, these flags were added to `CMAKE_CXX_FLAGS`, causing them to be passed to all compiler invocations including linking stages where they were irrelevant. This change moves p11-kit's cflags to `user_cflags`, which ensures the flags are correctly included in compilation commands but not in linking commands. This maintains the proper behavior in the ninja build system while fixing the issue in the CMake build system. `args.user_cflags` is preserved for its intended purpose of storing user-specified compiler flags passed via command line options. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23988	2025-06-25 11:24:09 +03:00
Andrzej Jackowski	ea2bdae45a	mapreduce: add tablet-aware dispatching algorithm The primary goal of this change is to reduce the time during which the Effective Replication Map (ERM) is retained by the mapreduce service. This ensures that long aggregate queries do not block topology operations. As ScyllaDB transitions towards tablets, which simplify work dispatching, the new algorithm is designed specifically for tablets. The algorithm divides work so that each `tablet_replica` (a <host, shard> pair) processes two tablets at a time. After processing of each `tablet_replica`, the ERM is released and re-acquired. The new algorithm can be summarized as follows: 1. Prepare a set of exclusive `partition_ranges`, where each range represents one tablet. This set is called `ranges_left`, because it contains ranges that still need processing. 2. Loop until `ranges_left` is empty: I. Create `tablet_replica` -> `ranges` mapping for the current ERM and `ranges_left`. Store this mapping and the number representing current ERM version as `ranges_per_replica`. II. In parallel, for each tablet_replica, iterate through ranges_per_tablet_replica. Select independently up to two ranges that are still existing in ranges_left. Remove each range selected for processing from ranges_left. Before each iteration, verify that ERM version has not changed. If it has, return to Step I. Steps I and II are exclusive to simplify maintaining `ranges_left` and `ranges_per_replica`: - Step I iterates through `ranges_left` and creates `ranges_per_replica` - Step II iterates through `ranges_per_replica` and remove processed ranges from `ranges_left` To maintain the exclusivity, the algorithm uses `parallel_for_each` in Step II, requiring all ongoing `tablet_replica` processing to finish before returning to Step I. Currently, each node can handle any partition range, even if the mapreduce supercoordinator does not retain the ERM and the range is absent locally. This is because `execute_on_this_shard` creates a new pager to coordinate the partition range read, including obtaining its own ERM. However, absent ranges are handled by shard 0, so proper routing is necessary to avoid overloading shard 0. Thus, in Step II, the ERM is retained during each `tablet_replica` processing. The tablet split scenario is not well-handled in this implementation. After a split, the entire pre-split range is sent to a node hosting the `tablet_replica` containing the range's `end_token`. The node will typically not have other tablets in the range, and as aforementioned, absent ranges are handled by shard 0. As a result, in such scenario, shard 0 handles a significant portion of the range. This issue is addressed later in this patch series by introducing `shard_id` in `mapreduce_request`. Ref. scylladb#21831	2025-06-25 10:18:02 +02:00
Kefu Chai	7d4dc12741	build: cmake: Use LINKER: prefix for consistent linker option handling Previously, we passed dynamic linker options like "-dynamic-linker=..." directly to the compiler driver with padded paths. This approach created inconsistency with the build commands generated by `configure.py`. This change implements a more consistent approach by: - Using the CMake "LINKER:" prefix to mark options that should be passed directly to the linker - Ensuring Clang properly receives these options via the `-Xlinker` flag The result is improved consistency between CMake-generated build commands and those created by `configure.py`, making the build system more maintainable and predictable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23987	2025-06-25 11:17:15 +03:00
Nadav Har'El	16c1365332	test,alternator: test server-side load balancing with zero-token node In issue #6527 it was suggested that a zero-token node (a.k.a coordinator- only node, or data-less node) could serve as a topology-aware Alternator load balancer - requests could be sent to it and they will be forwarded to the right node. This feature was implemented, but we never tested that it actually works for Alternator requests. So this patch tests this by starting a 5-node cluster with 4 regular nodes and one zero-token node, and testing that requests to the zero-token node work as expected. It is important to know that this feature does indeed work as expected, and also to have a regression test for it so the feature doesn't break in the future. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23114	2025-06-25 11:13:15 +03:00
Pablo Idiaquez	8137f34424	docs: troubleshooting/report-scylla-problem.rst: fix upload URL wrong url / hostname pointing to deprecated S3 bucket (we use GCP bucket now for uploads ) Fixes scylladb/scylladb#24639 Closes scylladb/scylladb#23533	2025-06-25 10:32:37 +03:00
Andrzej Jackowski	6d358cd7b2	storage_proxy: make storage_proxy::is_alive public The motivation is to allow other components (specifically mapreduce service) to use the method, just as storage_proxy::get_live_endpoints.	2025-06-25 08:59:04 +02:00
Andrzej Jackowski	9dbb1468b4	mapreduce: remove _shared_token_metadata from mapreduce_service Before this change, `mapreduce_service` used `_shared_token_metadata` to get the topology. However, the token was used in a part of the code that already had its own ERM with its own metadata token. Moreover, as mapreduce_service's token and ERM's token are not guaranteed to be the same, inconsistencies could occur. Therefore, this commit removes `_shared_token_metadata` and its usage.	2025-06-25 08:42:16 +02:00
Andrzej Jackowski	94ce5a0ed6	mapreduce: move dispatching logic to dispatch_to_vnodes This commit moves the current dispatching logic of the mapreduce service to a new dispatch_to_vnodes function. The moved code was written before tablets were introduced, and although it works with tablets, the variable naming still refers to vnodes (e.g., vnodes_per_addr, vnodes_generator). The motivation for this change is that later in this patch series, a new algorithm for tablets is introduced, and both algorithms need to coexist. Ref. scylladb#21831	2025-06-25 08:42:03 +02:00
Andrzej Jackowski	48aced87f5	mapreduce: remove underscores from variable names This commit removes unnecessary underscores from tr_state_ and dispatcher_ variable names, that were left after moving code to a separate function in the previous commit.	2025-06-25 08:41:21 +02:00
Andrzej Jackowski	d238a2f73e	mapreduce: move req_with_modified_pr handling to a new function The motivation for this change is to enable code reuse when a new implementation of the mapreduce algorithm for tablets is introduced later in this patch series. Ref. scylladb#21831	2025-06-25 08:40:02 +02:00
Aleksandra Martyniuk	0deb9209a0	test: rest_api: fix test_repair_task_progress test_repair_task_progress checks the progress of children of root repair task. However, nothing ensures that the children are already created. Wait until at least one child of a root repair task is created. Fixes: #24556. Closes scylladb/scylladb#24560	2025-06-25 09:08:06 +03:00
Botond Dénes	edc2906892	test/boost/sstable_datafile_test: add test for corrupt data * create a table with random schema * generate data: random mutations + one row with bad key * write data to sstable * check that only good data is written to sstable * check that the bad data was saved to system.corrupt_data	2025-06-25 08:41:29 +03:00
Botond Dénes	592ca789e2	sstables/mx/writer: handler rows with empty keys Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Use the recently introduced corrupt_data_handler to handle rows with such corrupt keys. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.	2025-06-25 08:41:29 +03:00
Botond Dénes	aae212a87c	test/lib/cql_assertions: introduce columns_assertions To enable targeted and optionally typed assertions against individual columns in a row.	2025-06-25 08:41:29 +03:00
Botond Dénes	ebd9420687	sstables: add corrupt_data_handler to sstables::sstables Similar to how large_data_handler is handled, propagate through sstables::sstables_manager and store its owner: replica::database. Tests and tools are also patched. Mostly mechanical changes, updating constructors and patching callers.	2025-06-25 08:41:26 +03:00
Botond Dénes	46ff7f9c12	tools/scylla-sstable: make large_data_handler a local No reason for it to be a global, not even convenience.	2025-06-25 08:35:19 +03:00
Andrei Chekun	d81e0d0754	test.py: pytest c++ facades should respect saving logs on success BostFacade and UnitFacade saving the logs only when test failed, ignoring the -s parameter that should allow save logs on success. This PR adding checking this parameter. Closes scylladb/scylladb#24596	2025-06-24 20:53:32 +03:00
Botond Dénes	3e1c50e9a7	db: introduce corrupt_data_handler Similar to large_data_handler, this interface allows sstable writers to delegate the handling of corrupt data. Two implementations are provided: * system_table_corrupt_data_handler - saved corrupt data in system.corrupt_data, with a TTL=10days (non-configurable for now) * nop_corrupt_data_handler - drops corrupt data	2025-06-24 14:57:00 +03:00
Botond Dénes	b931145a26	mutation: introduce frozen_mutation_fragment_v2 Mirrors frozen_mutation_fragment and shares most of the underlying serialization code, the only exception is replacing range_tombstone with range_tombstone_change in the mutation fragment variant.	2025-06-24 11:05:31 +03:00
Botond Dénes	64f8500367	mutation/mutation_partition_view: read_{clustering,static}_row(): return row type Instead of mutation_fragment, let caller convert into mutation_fragment. Allows reuse in future callers which will want to convert to mutation_fragment_v2.	2025-06-24 11:05:31 +03:00
Botond Dénes	678deece88	mutation/mutation_partition_view: extract de-ser of {clustering,static} row From the visitor in frozen_mutation_fragment::unfreeze(). We will want to re-use it in the future frozen_mutation_fragment_v2::unfreeze(). Code-movement only, the code is not changed.	2025-06-24 11:05:31 +03:00
Botond Dénes	093d4f8d69	idl-compiler.py: generate skip() definition for enums serializers Currently they only have the declaration and so far they got away with it, looks like no users exists, but this is about to change so generate the definition too.	2025-06-24 11:05:31 +03:00
Botond Dénes	b0d5462440	idl: extract full_position.idl from position_in_partition.idl A future user of position_in_partition.idl doesn't need full_position and so doesn't want to include full_position.hh to fix compile errors when including position_in_partition.idl.hh. Extract it to a separate idl file: it has a single user in a storage_proxy VERB.	2025-06-24 11:05:30 +03:00
Botond Dénes	0753643606	db/system_keyspace: add apply_mutation() Allow applying writes in the form of mutations directly to the keyspace. Allows lower-level mutation API to build writes. Advantageous if writes can contain large cells that would otherwise possibly cause large allocation warnings if used via the internal CQL API.	2025-06-24 11:05:30 +03:00
Botond Dénes	92b5fe8983	db/system_keyspace: introduce the corrupt_data table To serve as a place to store corrupt mutation fragments. These fragments cannot be written to sstables, as they would be spread around by compaction and/or repair. They even might make parsing the sstable impossible. So they are stored in this special table instead, kept around to be inspected later and possibly restored if possible.	2025-06-24 11:05:30 +03:00
Abhinav Jha	5ff693eff6	group0: modify `start_operation` logic to account for synchronize phase race condition In the present scenario, the bootstrapping node undergoes synchronize phase after initialization of group0, then enters post_raft phase and becomes fully ready for group0 operations. The topology coordinator is agnostic of this and issues stream ranges command as soon as the node successfully completes `join_group0`. Although for a node booting into an already upgraded cluster, the time duration for which, node remains in synchronize phase is negligible but this race condition causes trouble in a small percentage of cases, since the stream ranges operation fails and node fails to bootstrap. This commit addresses this issue and updates the error throw logic to account for this edge case and lets the node wait (with timeouts) for synchronize phase to get over instead of throwing error. A regression test is also added to confirm the working of this code change. The test adds a wait in synchronize phase for newly joining node and releases only after the program counter reaches the synchronize case in the `start_operation` function. Hence it indicates that in the updated code, the start_operation will wait for the node to get done with the synchronize phase instead of throwing error. This PR fixes a bug. Hence we need to backport it. Fixes: scylladb/scylladb#23536 Closes scylladb/scylladb#23829	2025-06-24 10:04:39 +02:00
Botond Dénes	bce89c0f5e	sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path So parse errors on corrupt SSTables don't result in crashes, instead just aborting the read in process. There are a lot of SCYLLA_ASSERT() usages remaining in sstables/. This patch tried to focus on those usages which are in the read path. Some places not only used on the read path may have been converted too, where the usage of said method is not clear.	2025-06-24 09:16:28 +03:00
Botond Dénes	27e26ed93f	sstables/exceptions: introduce parse_assert() To replace SCYLLA_ASSERT on the read/parse path. SSTables can get corrupt for various reasons, some outside of the database's control. A bad SSTable should not bring down the database, the parsing should simply be aborted, with as much information printed as possible for the investigation of the nature of the corruption. The newly introduced parse_assert() uses on_internal_error() under the hood, which prints a backtrace and optionally allows for aborting when on the error, to generate a coredump.	2025-06-24 09:15:29 +03:00
Jenkins Promoter	b0a7fcf21b	Update pgo profiles - aarch64	2025-06-23 19:20:50 +03:00
Jenkins Promoter	e15e5a6081	Update pgo profiles - x86_64	2025-06-23 19:20:50 +03:00
Marcin Maliszkiewicz	68ead01397	test: add test for live updates of generic server config Affected config: uninitialized_connections_semaphore_cpu_concurrency	2025-06-23 17:56:26 +02:00
Marcin Maliszkiewicz	45392ac29e	utils: don't allow do discard updateable_value observer If the object returned from observe() is destructured, it stops observing, potentially causing subtle bugs. Typically, the observer object is retained as a class member.	2025-06-23 17:54:01 +02:00
Marcin Maliszkiewicz	c6a25b9140	generic_server: fix connections semaphore config observer When temporary value returned by observer() is destructed it disconnects from updateable_value so the code immediately stops observing. To fix it we need to retain the observer in the class object.	2025-06-23 17:54:01 +02:00
Patryk Jędrzejczak	6489308ebc	Merge 'Introduce a queue of global topology requests.' from Gleb Natapov Currently only one global topology request (such as truncate, cdc repair, cleanup and alter table) can be pending. If one is already pending others will be rejected with an error. This is not very user friendly, so this series introduces a queue of global requests which allows queuing many global topology requests simultaneously. Fixes: #16822 No need to backport since this is a new feature. Closes scylladb/scylladb#24293 * https://github.com/scylladb/scylladb: topology coordinator: simplify truncate handling in case request queue feature is disable topology coordinator: fix indentation after the previous patch topology coordinator: allow running multiple global commands in parallel topology coordinator: Implement global topology request queue topology coordinator: Do not cancel global requests in cancel_all_requests topology coordinator: store request type for each global command topology request: make it possible to hold global request types in request_type field topology coordinator: move alter table global request parameters into topology_request table topology coordinator: move cleanup global command to report completion through topology_request table topology coordinator: no need to create updates vector explicitly topology coordinator: use topology_request_tracking_mutation_builder::done() instead of open code it topology coordinator: handle error during new_cdc_generation command processing topology coordinator: remove unneeded semicolon topology coordinator: fix indentation after the last commit topology coordinator: move new_cdc_generation topology request to use topology_request table for completion gms/feature_service: add TOPOLOGY_GLOBAL_REQUEST_QUEUE feature flag	2025-06-23 16:08:09 +03:00
Aleksandra Martyniuk	9c3fd2a9df	nodetool: repair: repair only vnode keyspaces nodetool repair command repairs only vnode keyspaces. If a user tries to repair a tablet keyspace, an exception is thrown. Closes scylladb/scylladb#23660	2025-06-23 16:08:09 +03:00
Avi Kivity	52f11e140f	tools: optimized_clang: make it work in the presence of a scylladb profile optimized_clang.sh trains the compiler using profile-guided optimization (pgo). However, while doing that, it builds scylladb using its own profile stored in pgo/profiles and decompressed into build/profile.profdata. Due to the funky directory structure used for training the compiler, that path is invalid during the training and the build fails. The workaround was to build on a cloud machine instead of a workstation - this worked because the cloud machine didn't have git-lfs installed, and therefore did not see the stored profile, and the whole mess was averted. To make this work on a machine that does have access to stored profiles, disable use of the stored profile even if it exists. Fixes #22713 Closes scylladb/scylladb#24571	2025-06-23 16:08:09 +03:00
Botond Dénes	ab96c703ff	mutation: check key of inserted rows Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations. The test row_cache_test/test_reading_of_nonfull_keys needs adjustment to work with the changes: it has to make the schema use compact storage, otherwise the non-full changes used by this tests are rejected by the new checks. Fixes: https://github.com/scylladb/scylladb/issues/24506	2025-06-23 09:38:45 +03:00
Botond Dénes	8b756ea837	compound: optimize is_full() for single-component types For such compounds, unserializing the key is not necessary to determine whether the key is full or not.	2025-06-23 09:38:45 +03:00
Nadav Har'El	85c19d21bb	Merge 'cql, schema: Extend keyspace, table, views, indexes name length limit from 48 to 192 bytes' from Karol Nowacki cql, schema: Extend name length limit from 48 to 192 bytes This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes. The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389) and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint. This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases. The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data. When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID. For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name. The directory name for this log table becomes the longest possible representation. Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas. To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows: 255 bytes (common filesystem limit for a path component) - 32 bytes (for the 32-character UUID string) - 1 byte (for the '-' separator) - 15 bytes (for the '_scylla_cdc_log' suffix) - 15 bytes (reserved for future use) ---------- = 192 bytes (Maximum allowed name length) This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038). This patch also updates/adds all associated tests to validate the new 192-byte limit. The documentation has been updated accordingly. Fixes #4480 Backport 2025.2: The significantly shorter maximum table name length in Scylla compared to Cassandra is becoming a more common issue for users in the latest release. Closes scylladb/scylladb#24500 * github.com:scylladb/scylladb: cql, schema: Extend name length limit from 48 to 192 bytes replica: Remove unused keyspace::init_storage()	2025-06-22 17:41:10 +03:00
Avi Kivity	770b91447b	Merge 'memtable: ensure _flushed_memory doesn't grow above total_memory' from Michał Chojnowski `dirty_memory_manager` tracks two quantities about memtable memory usage: "real" and "unspooled" memory usage. "real" is the total memory usage (sum of `occupancy().total_space()`) by all memtable LSA regions, plus a upper-bound estimate of the size of memtable data which has already moved to the cache region but isn't evictable (merged into the cache) yet. "unspooled" is the difference between total memory usage by all memtable LSA regions, and the total flushed memory (sum of `_flushed_memory`) of memtables. `dirty_memory_manager` controls the shares of compaction and/or blocks writes when these quantities cross various thresholds. "Total flushed memory" isn't a well defined notion, since the actual consumption of memory by the same data can vary over time due to LSA compactions, and even the data present in memtable can change over the course of the flush due to removals of outdated MVCC versions. So `_flushed_memory` is merely an approximation computed by `flush_reader` based on the data passing through it. This approximation is supposed to be a conservative lower bound. In particular, `_flushed_memory` should be not greater than `occupancy().total_space()`. Otherwise, for example, "unspooled" memory could become negative (and/or wrap around) and weird things could happen. There is an assertion in `~flush_memory_accounter` which checks that `_flushed_memory < occupancy().total_space()` at the end of flush. But it can fail. Without additional treatment, the memtable reader sometimes emits data which is already deleted. (In particular, it emites rows covered by a partition tombstone in a newer MVCC version.) This data is seen by `flush_reader` and accounted in `_flushed_memory`. But this data can be garbage-collected by the `mutation_cleaner` later during the flush and decrease `total_memory` below `_flushed_memory`. There is a piece of code in `mutation_cleaner` intended to prevent that. If `total_memory` decreases during a `mutation_cleaner` run, `_flushed_memory` is lowered by the same amount, just to preserve the asserted property. (This could also make `_flushed_memory` quite inaccurate, but that's considered acceptable). But that only works if `total_memory` is decreased during that run. It doesn't work if the `total_memory` decrease (enabled by the new allocator holes made by `mutation_cleaner`'s garbage collection work) happens asynchronously (due to memory reclaim for whatever reason) after the run. This patch fixes that by tracking the decreases of `total_memory` closer to the source. Instead of relying on `mutation_cleaner` to notify the memtable if it lowers `total_memory`, the memtable itself listens for notifications about LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's estimate of flushed memory decreased by the change in `total_memory` since the beginning of flush (if it was positive), and it keeps the amount of "spooled" memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`. Fixes scylladb/scylladb#21413 Backport candidate because it fixes a crash that can happen in existing stable branches. Closes scylladb/scylladb#21638 * github.com:scylladb/scylladb: memtable: ensure _flushed_memory doesn't grow above total memory usage replica/memtable: move region_listener handlers from dirty_memory_manager to memtable	2025-06-22 11:19:25 +03:00
Michał Chojnowski	975e7e405a	memtable: ensure _flushed_memory doesn't grow above total memory usage dirty_memory_manager tracks two quantities about memtable memory usage: "real" and "unspooled" memory usage. "real" is the total memory usage (sum of `occupancy().total_space()`) by all memtable LSA regions, plus a upper-bound estimate of the size of memtable data which has already moved to the cache region but isn't evictable (merged into the cache) yet. "unspooled" is the difference between total memory usage by all memtable LSA regions, and the total flushed memory (sum of `_flushed_memory`) of memtables. dirty_memory_manager controls the shares of compaction and/or blocks writes when these quantities cross various thresholds. "Total flushed memory" isn't a well defined notion, since the actual consumption of memory by the same data can vary over time due to LSA compactions, and even the data present in memtable can change over the course of the flush due to removals of outdated MVCC versions. So `_flushed_memory` is merely an approximation computed by `flush_reader` based on the data passing through it. This approximation is supposed to be a conservative lower bound. In particular, `_flushed_memory` should be not greater than `occupancy().total_space()`. Otherwise, for example, "unspooled" memory could become negative (and/or wrap around) and weird things could happen. There is an assertion in ~flush_memory_accounter which checks that `_flushed_memory < occupancy().total_space()` at the end of flush. But it can fail. Without additional treatment, the memtable reader sometimes emits data which is already deleted. (In particular, it emites rows covered by a partition tombstone in a newer MVCC version.) This data is seen `flush_reader` and accounted in `_flushed_memory`. But this data can be garbage-collected by the mutation_cleaner later during the flush and decrease `total_memory` below `_flushed_memory`. There is a piece of code in mutation_cleaner intended to prevent that. If `total_memory` decreases during a `mutation_cleaner` run, `_flushed_memory` is lowered by the same amount, just to preserve the asserted property. (This could also make `_flushed_memory` quite inaccurate, but that's considered acceptable). But that only works if `total_memory` is decreased during that run. It doesn't work if the `total_memory` decrease (enabled by the new allocator holes made by `mutation_cleaner`'s garbage collection work) happens asynchronously (due to memory reclaim for whatever reason) after the run. This patch fixes that by tracking the decreases of `total_memory` closer to the source. Instead of relying on `mutation_cleaner` to notify the memtable if it lowers `total_memory`, the memtable itself listens for notifications about LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's estimate of flushed memory decreased by the change in `total_memory` since the beginning of flush (if it was positive), and it keeps the amount of "spooled" memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.	2025-06-20 11:42:30 +02:00
Michał Chojnowski	7d551f99be	replica/memtable: move region_listener handlers from dirty_memory_manager to memtable The memtable wants to listen for changes in its `total_memory` in order to decrease its `_flushed_memory` in case some of the freed memory has already been accounted as flushed. (This can happen because the flush reader sees and accounts even outdated MVCC versions, which can be deleted and freed during the flush). Today, the memtable doesn't listen to those changes directly. Instead, some calls which can affect `total_memory` (in particular, the mutation cleaner) manually check the value of `total_memory` before and after they run, and they pass the difference to the memtable. But that's not good enough, because `total_memory` can also change outside of those manually-checked calls -- for example, during LSA compaction, which can occur anytime. This makes memtable's accounting inaccurate and can lead to unexpected states. But we already have an interface for listening to `total_memory` changes actively, and `dirty_memory_manager`, which also needs to know it, does just that. So what happens e.g. when `mutation_cleaner` runs is that `mutation_cleaner` checks the value of `total_memory` before it runs, then it runs, causing several changes to `total_memory` which are picked up by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of `total_memory` and passes the difference to `memtable`, which corrects whatever was observed by `dirty_memory_manager`. To allow memtable to modify its `_flushed_memory` correctly, we need to make `memtable` itself a `region_listener`. Also, instead of the situation where `dirty_memory_manager` receives `total_memory` change notifications from `logalloc` directly, and `memtable` fixes the manager's state later, we want to only the memtable listen for the notifications, and pass them already modified accordingl to the manager, so there is no intermediate wrong states. This patch moves the `region_listener` callbacks from the `dirty_memory_manager` to the `memtable`. It's not intended to be a functional change, just a source code refactoring. The next patch will be a functional change enabled by this.	2025-06-20 11:42:30 +02:00
Łukasz Paszkowski	a9a53d9178	compaction_manager: cancel submission timer on drain The `drain` method, cancels all running compactions and moves the compaction manager into the disabled state. To move it back to the enabled state, the `enable` method shall be called. This, however, throws an assertion error as the submission time is not cancelled and re-enabling the manager tries to arm the armed timer. Thus, cancel the timer, when calling the drain method to disable the compaction manager. Fixes https://github.com/scylladb/scylladb/issues/24504 All versions are affected. So it's a good candidate for a backport. Closes scylladb/scylladb#24505	2025-06-20 11:33:49 +03:00
Nadav Har'El	70f5a6a4d6	test/cqlpy: fix run-cassandra script to ignore CASSANDRA_HOME As test/cqlpy/README.md explains, the way to tell the run-cassandra script which version of Cassandra should be run is through the "CASSANDRA" variable, for example: CASSANDRA=$HOME/apache-cassandra-4.1.6/bin/cassandra \ test/cqlpy/run-cassandra test_file.py::test_function But all the Cassandra scripts, of all versions, have one strange feature: If you set CASSANDRA_HOME, then instead of running the actual Cassandra script you tried to run (in this case, 4.1.6), the Cassandra script goes to run the other Cassandra from CASSANDRA_HOME! This means that if a user happens to have, for some reason, set CASSANDRA_HOME, then the documented "CASSANDRA" variable doesn't work. The simple fix is to clear CASSANDRA_HOME in the environment that run-cassandra passes to Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24546	2025-06-20 11:31:02 +03:00
Anna Stuchlik	17eabbe712	doc: improve the tablets limitations section This PR improves the Limitations and Unsupported Features section for tablets, as it has been confusing to the customers. Refs https://github.com/scylladb/scylla-enterprise/issues/5465 Fixes https://github.com/scylladb/scylladb/issues/24562 Closes scylladb/scylladb#24563	2025-06-20 11:28:38 +03:00
Gleb Natapov	e364995e28	api: return error from get_host_id_map if gossiper is not enabled yet. Token metadata api is initialized before gossiper is started. get_host_id_map REST endpoint cannot function without the fully initialized gossiper though. The gossiper is started deep in the join_cluster call chain, but if we move token_metadata api initialization after the call it means that no api will be available during bootstrap. This is not what we want. Make a simple fix by returning an error from the api if the gossiper is not initialized yet. Fixes: #24479 Closes scylladb/scylladb#24575	2025-06-20 11:27:28 +03:00
Andrei Chekun	392a7fc171	test.py: Fix the boost output file name File name for the boost test do not use run_id, so each consequent run will overwrite the logs from the previous one. If the first repeat fails, and the second will pass, it overwrites the failed log. This PR allows saving the failed one. Closes scylladb/scylladb#24580	2025-06-20 11:26:16 +03:00
Asias He	c5a136c3b5	storage_service: Use utils::chunked_vector to avoid big allocation The following was seen: ``` !WARNING \| scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848 seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911 operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706 std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596 locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80 seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635 std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684 ``` Fix by using chunked_vector. Fixes #24158 Closes scylladb/scylladb#24561	2025-06-19 16:51:01 +03:00
Andrei Chekun	fcc2ad8ff5	test.py: Fix test result are overwritten Currently, CI uses several nodes to execute the different modes to reduce overall time for execution. During copying the results from nodes to the main job test reports will be overwritten, since they are using the same directory and the same name. This patch allows to distinguishing these results and not overwrite them. Closes scylladb/scylladb#24559	2025-06-19 16:51:01 +03:00
Pavel Emelyanov	dc166be663	s3: Mark claimed_buffer constructor noexcept It just std::move-s a buffer and a semaphore_units objects, both moves are noexcept, so is the constructor itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24552	2025-06-18 20:36:45 +03:00
Avi Kivity	c89ab90554	Merge 'main: don't start maintenance auth service if not enabled' from Marcin Maliszkiewicz In `f96d30c2b5` we introduced the maintenance service, which is an additional instance of auth::service. But this service has a somewhat confusing 2-level startup mechanism: it's initialized with sharded<Service>::start and then auth::service::start (different method with the same name to confuse even more). When maintenance_socket was disabled (default setting), the code did only the first part of the startup. This registered a config observer but didn't create a permission_cache instance. As a result, a crash on SIGHUP when config is reloaded can occur. Fixes: https://github.com/scylladb/scylladb/issues/24528 Backport: all not eol versions since 6.0 and 2025.1 Closes scylladb/scylladb#24527 * github.com:scylladb/scylladb: test: add test for live updates of permissions cache config main: don't start maintenance auth service if not enabled	2025-06-18 20:28:53 +03:00
Karol Nowacki	4577c66a04	cql, schema: Extend name length limit from 48 to 192 bytes This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes. The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389) and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint. This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases. The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data. When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID. For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name. The directory name for this log table becomes the longest possible representation. Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas. To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows: 255 bytes (common filesystem limit for a path component) - 32 bytes (for the 32-character UUID string) - 1 byte (for the '-' separator) - 15 bytes (for the '_scylla_cdc_log' suffix) - 15 bytes (reserved for future use) ---------- = 192 bytes (Maximum allowed name length) This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038). This patch also updates/adds all associated tests to validate the new 192-byte limit. The documentation has been updated accordingly.	2025-06-18 14:08:38 +02:00
Karol Nowacki	a41c12cd85	replica: Remove unused keyspace::init_storage() This function was declared but had no implementation or callers. It is being removed as minor code cleanup.	2025-06-18 14:08:38 +02:00
Petr Gusev	45f5efb9ba	paxos_state: read repair for intranode_migration A replica is not marked as 'pending' during intranode_migration. The sp::get_paxos_participants returns the same set of endpoints as before or after migration. No 'double quorum' means the replica should behave as a single paxos acceptor. This is done by making sure that the state on both shards is the same when reading and repairing it before continuing if it is not.	2025-06-18 12:11:32 +02:00
Petr Gusev	583fb0e402	paxos_state: fix get_replica_lock for intranode_migration Suppose a replica gets two requests at roughly the same time for the same key. The requests are coming from two different LWT coordinators, one is holding tablet_transition_stage::streaming erm, another - tablet_transition_stage::write_both_read_new erm. The read shard is different for these requests, so they don't wait each other in get_replica_lock. The first request reads the state, the second request does the whole RMW for paxos state and responds to its coordinator, then the first request blindly overwrites the state -- the effects of the second requst are lost. In this commit we fix this problem by taking the lock on both shards, starting from the smaller shard ID to the larger one, to avoid deadlocks.	2025-06-18 12:11:32 +02:00
Petr Gusev	aa970bf2e4	sp::cas_shard: rename to get_cas_shard We intend to introduce a separate cas_shard class in the next commits. We rename the existing function here to avoid conflicts.	2025-06-18 11:51:48 +02:00
Petr Gusev	85eac7c34e	token_metadata_guard: a topology guard for a token Data-plane requests typically hold a strong pointer to the effective_replication_map (ERM) to protect against tablet migrations and other topology operations. This works because major steps in the topology coordinator use global barriers. These barriers install a new token_metadata version on each shard and wait for all references to the old one to be dropped. Since the ERM holds a strong pointer to token_metadata, it effectively blocks these operations until it's released. For LWT, we usually deal with a single token within a single tablet. In such cases, it's enough to block topology changes for just that one tablet. The existing tablet_metadata_guard class already supports this: it tracks tablet-specific changes and updates the ERM pointer automatically, unless the change affects the guarded tablet. However, this only works for tablet-aware tables. To support LWT with vnodes (i.e., non-tablet-aware tables), this commit introduces a new token_metadata_guard class. It wraps tablet_metadata_guard when the table uses tablets, and falls back to holding a plain strong ERM pointer otherwise. In the next commits, we’ll migrate LWT to use token_metadata_guard in paxos_response_handler instead of erm.	2025-06-18 11:51:48 +02:00
Petr Gusev	73221aa7b1	tablet_metadata_guard: mark as noncopyable and nonmoveable tablet_metadata_guard passes a raw pointer to get_validity_abort_source, so it can't be easily copied or moved. In this commit we make this explicit. We define destructor in cpp -- the autogenerated one complains on lw_shared_ptr<replica::table> as replica::table is only forward-declared in the headers.	2025-06-18 11:50:46 +02:00
Marcin Maliszkiewicz	dd01852341	test: add test for live updates of permissions cache config	2025-06-18 11:27:08 +02:00
Marcin Maliszkiewicz	97c60b8153	main: don't start maintenance auth service if not enabled In `f96d30c2b5` we introduced the maintenance service, which is an additional instance of auth::service. But this service has a somewhat confusing 2-level startup mechanism: it's initialized with sharded<Service>::start and then auth::service::start (different method with the same name to confuse even more). When maintenance_socket was disabled (default setting), the code did only the first part of the startup. This registered a config observer but didn't create a permission_cache instance. As a result, a crash on SIGHUP when config is reloaded can occur.	2025-06-18 11:27:08 +02:00
Botond Dénes	da1a3dd640	Merge 'test: introduce upgrade tests to test.py, add a SSTable dict compression upgrade test' from Michał Chojnowski This PR adds an upgrade test for SSTable compression with shared dictionaries, and adds some bits to pylib and test.py to support that. In the series, we: 1. Mount `$XDG_CACHE_DIR` into dbuild. 2. Add a pylib function which downloads and installs a released ScyllaDB package into a subdirectory of `$XDG_CACHE_DIR/scylladb/test.py`, and returns the path to `bin/scylla`. 3. Add new methods and params to the cluster manager, which let the test start nodes with historical Scylla executables, and switch executables during the test. 4. Add a test which uses the above to run an upgrade test between the released package and the current build. 5. Add `--run-internet-dependent-tests` to `test.py` which lets the user of `test.py` skip this test (and potentially other internet-dependent tests in the future). (The patch modifying `wait_for_cql_and_get_hosts` is a part of the new test — the new test needs it to test how particular nodes in a mixed-version cluster react to some CQL queries.) This is a follow-up to #23025, split into a separate PR because the potential addition of upgrade tests to `test.py` deserved a separate thread. Needs backport to 2025.2, because that's where the tested feature is introduced. Fixes #24110 Closes scylladb/scylladb#23538 * github.com:scylladb/scylladb: test: add test_sstable_compression_dictionaries_upgrade.py test.py: add --run-internet-dependent-tests pylib/manager_client: add server_switch_executable test/pylib: in add_server, give a way to specify the executable and version-specific config pylib: pass scylla_env environment variables to the topology suite test/pylib: add get_scylla_2025_1_executable() pylib/scylla_cluster: give a way to pass executable-specific options to nodes dbuild: mount "$XDG_CACHE_HOME/scylladb"	2025-06-18 12:21:21 +03:00
Benny Halevy	7c867b308f	feature_service: never disable UUID_SSTABLE_IDENTIFIERS The config option is unused since `6da758d74c` Refs #10459 Refs #20337 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	ecc7272a07	test: sstable_move_test: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	49ca442e7c	test: sstable_directory_test: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	15bee9f232	sstables: sstable_generation_generator: set last_generation=0 by default Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	079c5fe5e3	test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	f0f7c83705	test: lib: test_env: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	0310a03de6	test: sstable_test: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	b00b805da6	test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config Which is `true` by default anyhow. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	f644c5896f	test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	bfa0bb78f9	test: sstable_compaction_test: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Avi Kivity	2177ec8dc1	gdb: adjust unordered container accessors for libstdc++15 In libstdc++15, the internal structure of an unordered container hashtable node changed from _M_storage._M_storage.__data to just _M_storage._M_storage (though the layout is the same). Adjust the code to work with both variants. Closes scylladb/scylladb#24549	2025-06-18 09:15:03 +03:00
Michał Chojnowski	27f66fb110	test/boost/mutation_reader_test: fix a use-after-free in `test_fast_forwarding_combined_reader_is_consistent_with_slicing` The contract in mutation_reader.hh says: ``` // pr needs to be valid until the reader is destroyed or fast_forward_to() // is called again. future<> fast_forward_to(const dht::partition_range& pr) { ``` `test_fast_forwarding_combined_reader_is_consistent_with_slicing` violates this by passing a temporary to `fast_forward_to`. Fix that. Fixes scylladb/scylladb#24542 Closes scylladb/scylladb#24543	2025-06-17 19:30:50 +03:00
Anna Stuchlik	648d8caf27	doc: add support for z3 GCP This commit adds support for z3-highmem-highlssd instance types to Cloud Instance Recommendations for GCP. Fixes https://github.com/scylladb/scylladb/issues/24511 Closes scylladb/scylladb#24533	2025-06-17 13:50:46 +03:00
Robert Bindar	1dd37ba47a	Add dev documentation for manipulating s3 data manually This patch intends to give an overview of where, when and how we store data in S3 and provide a quick set of commands which help gain local access to the data in case there is a need for manual intervention. The patch also collects in the same place links/descriptions for all formats we use in S3. Fixes #22438 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#24323	2025-06-17 13:21:30 +03:00
Pavel Emelyanov	b0766d1e73	Merge 's3_client: Refactor `range` class for state validation' from Ernest Zaslavsky Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3. This should address and prevent future problems related to this issue https://github.com/minio/minio/issues/21333 No backport needed since this problem related only to this change https://github.com/scylladb/scylladb/pull/23880 Closes scylladb/scylladb#24312 * github.com:scylladb/scylladb: s3_client: headers cleanup s3_client: Refactor `range` class for state validation	2025-06-17 10:34:55 +03:00
Ernest Zaslavsky	e398576795	s3_client: Fix hang in get() on EOF by signaling condition variable * Ensure _get_cv.signal() is called when an empty buffer received * Prevents `get()` from stalling indefinitely while waiting on EOF * Found when testing https://github.com/scylladb/scylladb/pull/23695 Closes scylladb/scylladb#24490	2025-06-17 10:33:19 +03:00
Calle Wilund	4a98c258f6	http: Add missing thread_local specifier for static Refs #24447 Patch adding this somehow managed to leave out the thread_local specifier. While gnutls cert object can be shared across shards just fine, the actual shared_ptr here cannot, thus we could cause memory errors. Closes scylladb/scylladb#24514	2025-06-17 10:23:52 +03:00
Avi Kivity	cd79a8fc25	Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz" This reverts commit `0b516da95b`, reversing changes made to `30199552ac`. It breaks cluster.random_failures.test_random_failures.test_random_failures in debug mode (at least). Fixes #24513	2025-06-16 22:38:12 +03:00
Ernest Zaslavsky	1b20e0be4a	s3_client: headers cleanup	2025-06-16 16:02:30 +03:00
Ernest Zaslavsky	9ad7a456fe	s3_client: Refactor `range` class for state validation Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3.	2025-06-16 16:02:24 +03:00
Pavel Emelyanov	5c2e5890a6	Merge 'test.py: Integrate pytest c++ test execution to test.py' from Andrei Chekun With current changes, pytest executes boost tests. Gathering metrics added to the pytest BoostFacade and UnitFacade to have the possibility to get them for C++ test as previously. Since boost, raft, unit, and ldap directories aren't executed by test.py, suite.yaml files are renamed to test_config.yaml to preserve the old way of test configuration and removing them from execution by test.py Pytest executes all modes by itself, JUnit report for the C++ test will be one for the run. That means that there is no possibility to output them in testlog in different folders. So testlog/report directory is used to store all kinds of reports generated during tests. JUnit reports should be testlog/report/junit, Allure reports should be in testlog/report/allure. Breaking changes: 1. Terminal output changed. test.py will run pytest for the next directories: `test/boost`, `test/ldap`, `test/raft`, `test/unit`. `test.py` will blindly translate the output of the pytest to the terminal. Then when all these tests are finished, `test.py` will continue to show previous output for the rest of the test. 2. The format of execution of C++ test directories mentioned above has been changed. Now it will be a simple path to the file with extension. For example, instead of `boost/aggregate_fcts_test` now you need to use `test/boost/aggregate_fcts_test.cc` 3. This PR creates a spike in test amount. The previous logic was to consolidate the boost results from different runs and different modes to one report. So for the three repeats and three modes (nine test results) in CI was shown one result. Now it shows nine results, with differentiating them by mode and run. Note: Pytest uses pytest-xdist module to run tests in parallel. The Frozen toolchain has this dependency installed, for the local use, please install it manually. Changes for CI https://github.com/scylladb/scylla-pkg/pull/4949. It will be merged after the current PR will be in master. Short disruption is expected, while PR in scylla-pkg will not be merged. Fixes: https://github.com/scylladb/qa-tasks/issues/1777 Closes scylladb/scylladb#22894 * github.com:scylladb/scylladb: test.py: clean code that isn't used anymore test.py: switch off C++ tests from test.py discovery test.py: Integrate pytest c++ test execution to test.py	2025-06-16 16:01:37 +03:00
Pavel Emelyanov	0b6532a895	api: Shorten get_simple_states() handler The one collects map<ip, state> then converts it to a jsonable vector of helper objects with key and value members. This patch removes the intermediate map and creates the vector instantly. With that change the handler makes less data manipulations and behaves like the get_all_endpoint_states one. Very similar change was done in `12420dc644` with get_host_to_id_map handler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24456	2025-06-16 15:21:27 +03:00
Tomasz Grabiec	cdb1499898	Merge 'interval: reduce memory footprint' from Avi Kivity The interval class's memory footprint isn't important for single objects, but intervals are frequently held in moderately sized collections. In #3335 this caused a stall. Therefore reducing interval's memory footprint and reduce allocation pressure. This series does this by consolidating badly-padded booleans in the object tree spanned by interval into 5 booleans that are consecutive in memory. This reduces the space required by these booleans from 40 bytes to 8 bytes. perf-simple-query report (with refresh-pgo-profiles.sh for each measurement): before: 252127.60 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37128 insns/op, 18147 cycles/op, 0 errors) INFO 2025-06-07 21:00:34,010 [shard 0:main] group0_tombstone_gc_handler - Setting reconcile time to 1749319231 (min id=4dbed2f4-43c9-11f0-cbc6-87d1a08b4ca4) 246492.37 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37153 insns/op, 18411 cycles/op, 0 errors) 253633.11 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37127 insns/op, 17941 cycles/op, 0 errors) 254029.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37155 insns/op, 17951 cycles/op, 0 errors) 254465.76 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37123 insns/op, 17906 cycles/op, 0 errors) throughput: mean= 252149.75 standard-deviation=3282.75 median= 253633.11 median-absolute-deviation=1880.17 maximum=254465.76 minimum=246492.37 instructions_per_op: mean= 37137.24 standard-deviation=15.71 median= 37127.54 median-absolute-deviation=14.45 maximum=37155.24 minimum=37122.79 cpu_cycles_per_op: mean= 18071.19 standard-deviation=212.25 median= 17950.62 median-absolute-deviation=130.10 maximum=18411.50 minimum=17906.13 after: 252561.26 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37039 insns/op, 18075 cycles/op, 0 errors) 256876.44 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37022 insns/op, 17785 cycles/op, 0 errors) 257084.38 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37030 insns/op, 17840 cycles/op, 0 errors) 257305.35 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37042 insns/op, 17804 cycles/op, 0 errors) 258088.53 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37028 insns/op, 17778 cycles/op, 0 errors) throughput: mean= 256383.19 standard-deviation=2185.22 median= 257084.38 median-absolute-deviation=922.16 maximum=258088.53 minimum=252561.26 instructions_per_op: mean= 37032.17 standard-deviation=8.06 median= 37030.46 median-absolute-deviation=6.44 maximum=37041.83 minimum=37021.93 cpu_cycles_per_op: mean= 17856.60 standard-deviation=124.70 median= 17804.16 median-absolute-deviation=71.24 maximum=18075.50 minimum=17777.95 A small improvement is observed in instructions_per_op. It could be random fluctuations in the compiler performance, or maybe the default constructor/destructor of interval are meaningful even in this simple test. Small performance improvement, so not a backport candidate. Closes scylladb/scylladb#24232 * github.com:scylladb/scylladb: interval: reduce sizeof interval: change start()/end() not to return references to data members interval: rename start_ref() back to start() (and end_ref() etc). interval: rename start() to start_ref() (and end() etc). test: wrapping_interval_test: add more tests for intervals	2025-06-16 09:23:56 +02:00
Botond Dénes	898ce98500	db/batchlog_manager: remove unused member _total_batches_replayed And its getter. There are no users for either. Closes scylladb/scylladb#24416	2025-06-16 09:37:00 +03:00
Nadav Har'El	847d9c0911	alternator: update documentation that ttl with tablets does work Our documentation docs/alternator/new-apis.md claims that Alternator TTL does not work with tablets, due to issue #16567. However, we fixed that issue in commit `de96c28625`. So let's drop the outdated statement that it doesn't work. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24427	2025-06-16 09:36:11 +03:00
Ernest Zaslavsky	2b300c8eb9	s3_client: Improve reporting of S3 client statistics Revise how we report statistics for `chunked_download_source`. Ensure metrics for downloaded but unconsumed data are visible, as they do not contribute to read amplification, which is tracked separately. Closes scylladb/scylladb#24491	2025-06-16 09:33:57 +03:00
Pavel Emelyanov	9aaa33c15a	Merge 'main.cc: fix group0 shutdown order' from Petr Gusev Applier fiber needs local storage, so before shutting down local storage we need to make sure that group0 is stopped. We also improve the logs for the case when `gate_closed_exception` is thrown while a mutation is being written. Fixes [scylladb/scylladb#24401](https://github.com/scylladb/scylladb/issues/24401) Backport: no backport -- not safe and the problem is minor. Closes scylladb/scylladb#24418 * github.com:scylladb/scylladb: storage_service: test_group0_apply_while_node_is_being_shutdown main.cc: fix group0 shutdown order storage_proxy: log gate_closed_exception	2025-06-16 09:32:34 +03:00
Amnon Heiman	55b21b01ee	alternator/stats.cc, metrics-config.yml: docs fix per-table metrics This patch updates alternator/stats.cc and the get_description.py configuration (metrics-config.yml) to restore compatibility with per-table alternator metrics in the documentation generation process. Previously, the group name for metrics was selected using an inline expression like (has_table)? "alternator_table" : "alternator", which made it difficult to maintain a straightforward mapping in the configuration file. With this change, the group name is now assigned to a variable in alternator/stats.cc, allowing metrics-config.yml to map group names directly. This makes the configuration easier to maintain and enables get_description.py to document both global and per-table metrics correctly. This is a minimal, targeted fix to get the documentation working again with the new per-table metrics format. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes scylladb/scylladb#24509	2025-06-15 18:06:36 +03:00
Jenkins Promoter	1b5eee6a12	Update pgo profiles - aarch64	2025-06-15 04:57:59 +03:00
Jenkins Promoter	e0c2d591c7	Update pgo profiles - x86_64	2025-06-15 04:44:13 +03:00
Avi Kivity	42d7ae1082	interval: reduce sizeof An interval object stores five booleans: start()->is_inclusive(), a boolean since start() itself is an std::optional, two more for end(), and is_singular(). Due to bad packing, these five booleans occupy 8 bytes each, for a total of 40 bytes. Re-pack the interval class by storing those booleans explicitly close by. Since we lose std::optional's ability to store a maybe-constructed object, we re-implement it using anonymous unions and therefore have to implement the 5 special methods. This helps saves space when vectors of intervals are used, as seen in #3335 for example.	2025-06-14 21:29:43 +03:00
Avi Kivity	f3dccc2215	interval: change start()/end() not to return references to data members We'd like to change the data layout of `interval` to save space. As a result, start() and end() which return references to data members must return objects (not references). Since we'd like to maintain zero-copy for these functions, we change them to return objects containing references (rather than references to objects), avoiding copying of potentially expensive objects. We repurpose the interval_bound class to hold references (by instantiating it with `const T&` instead of `T`) and provide converting constructors. To make transform_bounds() retain zero-copy, we add start() and end() that take *this by rvalue reference.	2025-06-14 21:26:17 +03:00
Avi Kivity	16fb68bb5e	interval: rename start_ref() back to start() (and end_ref() etc). To reduce noise, rename start_ref() back to its original name start(), after it was changed in the previous patch to force an audit of all calls.	2025-06-14 21:26:16 +03:00
Avi Kivity	3363bc41e2	interval: rename start() to start_ref() (and end() etc). We are about to change start() to return a proxy object rather than a `const interval_bound<T>&`. This is generally transparent, except in one case: `auto x = i.start()`. With the current implementation, we'll copy object referred to and assign it to x. With the planned implementation, the proxy object will be assigned to `x`, but it will keep referring to `i`. To prevent such problems, rename start() to start_ref() and end() to end_ref(). This forces us to audit all calls, and redirect calls that will break to new start_copy() and end_copy() methods.	2025-06-14 21:26:16 +03:00
Avi Kivity	674118fd2e	test: wrapping_interval_test: add more tests for intervals In this series, we will make interval manage its memory directly, specifically it will directly construct and destroy T values that it contains rather than let std::optional<T> manage those values itself. Add tests that expose bugs encountered during development (actually, review) of this series. The tests pass before the series, fail with series as it was before fixing, and pass with the series as it is now. The tests use a class maybe_throwing_interval_payload that can be set to throw at strategic locations and exercise all the interesting interval shapes.	2025-06-14 21:26:14 +03:00
Patryk Jędrzejczak	c4cf95aeb3	Merge 'raft: simplify voter handler code to not pass node references around' from Emil Maskovsky Refactor the voter handler logic to only pass around node IDs (`raft::server_id`), instead of pairs of IDs and node descriptor references. Node descriptors can always be efficiently retrieved from the original nodes map, which remains valid throughout the calculation. This change reduces unnecessary reference passing and simplifies the code. All node detail lookups are now performed via the central nodes map as needed. Additional cleanup has been done: * removing redundant comments (that just repeat what the code does) * use explicit comparators for the datacenter and rack information priorities (instead of the comparison operator) to be more explicit about the prioritization Fixes: scylladb/scylladb#24035 No backport: This change does not fix any bug and doesn't change the behavior, just cleans up the code in master, therefore no backport is needed. Closes scylladb/scylladb#24452 * https://github.com/scylladb/scylladb: raft: simplify voter handler code to not pass node references around raft: reformat voter handler for consistent indentation raft: use explicit priority comparators for datacenters and racks raft: clean up voter handler by removing redundant comments	2025-06-13 19:02:07 +02:00
Anna Stuchlik	e2b7302183	doc: extend 2025.2 upgrade with a note about consistent topology updates This commit adds a note that the user should enable consistent topology updates before upgrading to 2025.2 if they didn't do it (for some reason) when previously upgrading to version 2025.1. Fixes https://github.com/scylladb/scylladb/issues/24467 Closes scylladb/scylladb#24468	2025-06-13 13:54:59 +03:00
Piotr Dulikowski	238fc24800	Merge 'test: dtest: move audit_test.py to test.py' from Andrzej Jackowski Copied the entire audit_test.py from scylladb/scylla-dtest, to remove the entire file from scylla-dtest after this patch series is merged. The motivation is to move entire audit testing to from dtests, to make it easier to maintain and more reliable. After audit_test.py was moved from dtests to test.py, some issues that require fixing arose due to differences between the frameworks. No backport, moving audit_test.py to test.py is a new testing effort. Closes scylladb/scylladb#24231 * github.com:scylladb/scylladb: test: audit: filter out LOGIN and USE audit logs test: audit: remove require mark test: audit: wait until raft state is applied in test_permissions test: audit: fix problems in audit_test.py test: dtest: add dict support to populate in scylla_cluster.py test: dtest: copied get_node_ip from dtests to scylla_cluster.py test: dtest: copy run_rest_api from dtests to cluster.py test: dtest: copy run_in_parallel from dtests to data.py test: audit: copy unmodified audit_test.py from dtests	2025-06-12 09:03:45 +02:00
Andrei Chekun	570aaa2ecb	test.py: clean code that isn't used anymore Clean code that is not used anymore	2025-06-11 18:29:26 +02:00
Andrei Chekun	9dca7719b1	test.py: switch off C++ tests from test.py discovery Switch off C++ tests from test.py discovery. With this change, test.py loses the ability to directly see and run the C++ tests. Instead, it'll delegate all things to the pytest. Since boost, raft, unit, and ldap directories aren't executed by test.py, suite.yaml files are renamed to test_config.yaml to preserve the old way of test configuration and removing them from execution by test.py Before this patch boost test were visible by test.py and pytest. So if the test.py will be invoked without test name, it will execute boost tests twice: with test.py executor and with pytest executor. Depending on the test name according executor will be used. For example, if test name is test/boost/aggregate_fcts_test.cc it will be executed by pytest, but if the boost/aggregate_fcts_test it will be executed by test.py executor.	2025-06-11 18:29:26 +02:00
Andrei Chekun	42d9dbe66a	test.py: Integrate pytest c++ test execution to test.py With current changes pytest executes boost tests. Gathering metrics added to the pytest BoostFacade and UnitFacade to have the possibility to get them for C++ test as previously. Since pytest executes all modes by itself JUnit report for the C++ test will be one for the run. That means that there is no possibility to output them in testlog in different folders. So testlog/report directory is used to store all kinds of reports generated during tests. JUnit reports should be testlog/report/junit, Allure reports should be in testlog/report/allure. Breaking changes: 1. Terminal output changed. test.py will run pytest for next directories: test/boost, test/ldap, test/raft, test/unit. test.py will blindly translate the output of the pytest to the terminal. Then when all these tests are finished, test.py will continue to show previous output for the rest of the test. 2. The format of execution of C++ test directories mentioned above has been changed. Now it will be a simple path to the file with extension. For example, instead of boost/aggregate_fcts_test now you need to use test/boost/aggregate_fcts_test.cc 3. This PR creates a spike in test amount. The previous logic was to consolidate the boost results from different runs and different modes to one report. So for the three repeats and three modes (nine test results) in CI was shown one result. Now it shows nine results with differentiating them by mode and run. Note: Pytest uses pytest-xdist module to run tests in parallel. Frozen toolchain has this dependency installed, for the local use, please install it manually.	2025-06-11 18:29:23 +02:00
Tomasz Grabiec	eabc1fa6ff	Merge 'tablets: deallocate storage state on end_migration' from Michael Litvak When a tablet is migrated and cleaned up, deallocate the tablet storage group state on `end_migration` stage, instead of `cleanup` stage: * When the stage is updated from `cleanup` to `end_migration`, the storage group is removed on the leaving replica. * When the table is initialized, if the tablet stage is `end_migration` then we don't allocate a storage group for it. This happens for example if the leaving replica is restarted during tablet migration. If it's initialized in `cleanup` stage then we allocate a storage group, and it will be deallocated when transitioning to `end_migration`. This guarantees that the storage group is always deallocated on the leaving replica by `end_migration`, and that it is always allocated if the tablet wasn't cleaned up fully yet. It is a similar case also for the pending replica when the migration is aborted. We deallocate the state on `revert_migration` which is the stage following `cleanup_target`. Previously the storage group would be allocated when the tablet is initialized on any of the tablet replicas - also on the leaving replica, and when the tablet stage is `cleanup` or `end_migration`, and deallocated during `cleanup`. This fixes the following issue: 1. A migrating tablet enters cleanup stage 2. the tablet is cleaned up successfuly 3. The leaving replica is restarted, and allocates storage group 4. tablet cleanup is not called because it's already cleaned up 5. the storage group remains allocated on the leaving replica after the migration is completed - it's not cleaned up properly. Fixes https://github.com/scylladb/scylladb/issues/23481 backport to all relevant releases since it's a bug that results in a crash Closes scylladb/scylladb#24393 * github.com:scylladb/scylladb: test/cluster/test_tablets: test restart during tablet cleanup test: tablets: add get_tablet_info helper tablets: deallocate storage state on end_migration	2025-06-11 17:37:02 +02:00
Aleksandra Martyniuk	83c9af9670	test: add test for repair and resize finalization Add test that checks whether repair does not start if there is an ongoing resize finalization.	2025-06-11 16:17:39 +02:00
Aleksandra Martyniuk	df152d9824	repair: postpone repair until topology is not busy Currently, repair_service::repair_tablets starts repair if there is no ongoing tablet operations. The check does not consider global topology operations, like tablet resize finalization. This may cause a data race and unexpected behavior. Start repair when topology is not busy.	2025-06-11 15:38:43 +02:00
Gleb Natapov	c00a0554e0	topology coordinator: simplify truncate handling in case request queue feature is disable After allowing running multiple command in parallel the code that handles multiple truncates to the same table can be simplified since now it is executed only if request queue feature is disable, so it does not need to handle the case where a request may be in the queue.	2025-06-11 11:29:33 +03:00
Gleb Natapov	01dd4b7f30	topology coordinator: fix indentation after the previous patch	2025-06-11 11:29:33 +03:00
Gleb Natapov	a9e99d1d3c	topology coordinator: allow running multiple global commands in parallel Now that we have a global request queue do not check that there is global request before adding another one. Amend truncation test that expects it explicitly and add another one that checks that two truncates can be submitted in parallel.	2025-06-11 11:29:33 +03:00
Gleb Natapov	a0a3a034e0	topology coordinator: Implement global topology request queue Requests, together with their parameters, are added to the topology_request tables and the queue of active global requests is kept in topology state. Thy are processed one by one by the topology state machine. Fixes: #16822	2025-06-11 11:29:33 +03:00
Andrzej Jackowski	e23d79cb62	test: audit: filter out LOGIN and USE audit logs LOGIN entries can appear at many points during testing, for example, when a driver creates a new session. Similarly, `USE ks` statements can appear unexpectedly, especially when the python-driver calls `set_keyspace_async` for new connections. To avoid test checks failures, this commit filters out LOGIN and USE entries in tests that are not intended to verify these two types of audit logs.	2025-06-11 09:43:51 +02:00
Andrzej Jackowski	876eaf459b	test: audit: remove require mark After moving audit tests to dtests, require marks are no longer needed because the tests and the code are in the same repository.	2025-06-11 09:43:51 +02:00
Marcin Maliszkiewicz	111cccf8ba	test: audit: wait until raft state is applied in test_permissions Otherwise test is flaky, expecting permissions to be enforced before they get applied.	2025-06-11 09:43:51 +02:00
Andrzej Jackowski	6c6234979c	test: audit: fix problems in audit_test.py After audit_test.py was moved from dtests to test.py, the following issues arose due to differences between the frameworks: - Some imports were unnecessary or broken - The @pytest.mark.dtest_full decorator was no longer needed - The `issue_open` attribute in `xmark` is not supported - Support for sending SIGHUP is encapsulated by `server_update_config` in test.py` - A workaround for scylladb#24473 was required Moreover, suite.yaml was changed to start running audit_test.py in dev mode. Ref. scylladb#24473 Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>	2025-06-11 09:43:44 +02:00
Michał Chojnowski	0ade15df33	transport/server: silence the oversized allocation warning in snappy_compress It has been observed to generate ~200 kiB allocations. Since we have already been made aware of that, we can silence the warning to clean up the logs. Closes scylladb/scylladb#24360	2025-06-10 19:13:26 +03:00
Petr Gusev	b1050944a3	storage_service: test_group0_apply_while_node_is_being_shutdown	2025-06-10 17:25:03 +02:00
Petr Gusev	6b85ab79d6	main.cc: fix group0 shutdown order group0 persistence relies on local storage, so before shutting down local storage we need to make sure that group0 is stopped. Fixes scylladb/scylladb#24401	2025-06-10 16:06:22 +02:00
Wojciech Mitros	5eb4466789	Return correct creation date time in describe table Add system:table_creation_time tag with value - timestamp in milliseconds of creation table. If the tag is present, it will used to fill creation timestamp value (when CreateTable or DescribeTable is called). If the tag is missing, value 0 for timestamp will be substituted (in other words table was created on 1th january of 1970). Update test to change how we make sure timestamp is actually used - we create two tables one after another and make sure their creation timestamp is in correct order. Update tests, that work with tags to filter system tags out. Fixes #5013 Closes scylladb/scylladb#24007	2025-06-10 15:25:57 +03:00
Nadav Har'El	ed3a0a81d6	test/cqlpy: add some more tests of secondary index system tables This patch adds a couple of basic tests for system tables related to secondary indexes - system."IndexInfo" and system_schema.indexes. I wanted to understand these system tables better when writing documentation for them - so wrote these tests. These tests can also serve as regression tests that verify that we don't accidentally lose support for these system tables. I checked that these tests also pass in Cassandra 3, 4 and 5. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24137	2025-06-10 15:00:51 +03:00
Tomasz Grabiec	0b516da95b	Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft. Pulling `database::apply()` out of schema merging code will allow to batch changes to subsystems. Future generic code will first call `prepare()` on all implementations, then single `database::apply()` and then `update()` on all implementations, then on each shard it will call `commit()` for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then `post_commit()`. Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/19649 Closes scylladb/scylladb#20853 * github.com:scylladb/scylladb: storage_service: always wake up load balancer on update tablet metadata db: schema_applier: call destroy also when exception occurs db: replica: simplify seeding ERM during shema change db: remove cleanup from add_column_family db: abort on exception during schema commit phase db: make user defined types changes atomic replica: db: make keyspace schema changes atomic db: atomically apply changes to tables and views replica: make truncate_table_on_all_shards get whole schema from table_shards service: split update_tablet_metadata into two phases service: pull out update_tablet_metadata from migration_listener db: service: add store_service dependency to schema_applier service: simplify load_tablet_metadata and update_tablet_metadata db: don't perform move on tablet_hint reference replica: split add_column_family_and_make_directory into steps replica: db: split drop_table into steps db: don't move map references in merge_tables_and_views() db: introduce commit_on_shard function db: access types during schema merge via special storage replica: make non-preemptive keyspace create/update/delete functions public replica: split update keyspace into two phases replica: split creating keyspace into two functions db: rename create_keyspace_from_schema_partition db: decouple functions and aggregates schema change notification from merging code db: store functions and aggregates change batch in schema_applier db: decouple tables and views schema change notifications from merging code db: store tables and views schema diff in schema_applier db: decouple user type schema change notifications from types merging code service: unify keyspace notification functions arguments db: replica: decouple keyspace schema change notifications to a separate function db: add class encapsulating schema merging	2025-06-10 13:45:32 +02:00
Ernest Zaslavsky	30199552ac	s3_client: Mitigate connection exhaustion in `download_source` The existing `download_source` implementation optimizes performance by keeping the connection to S3 open and draining data directly from the socket. While this eliminates the overhead (60-100ms) of repeatedly establishing new connections, it leads to rapid exhaustion of client- side connections. On a single shard, two `mx_readers` for load and stream are enough to trigger this issue. Since each client typically holds two connections, readers keeping index and data sources open can cause deadlocks where processes stall due to unavailable connections. Introduce `chunked_download_source`, a new S3 download method built on `download_source`, to dynamically manage connections: - Buffers data in 5MiB chunks using a producer-consumer model - Closes connections once buffers reach capacity, returning them to the pool for other clients - Uses a filling fiber that resumes fetching once buffers are consumed from the queue Performance remains comparable to `download_source`, achieving 95MiB/s for sequential 1GiB downloads from S3. However, preloading large chunks may cause read amplification. Fixes: https://github.com/scylladb/scylladb/issues/23785 Closes scylladb/scylladb#23880	2025-06-10 12:58:24 +03:00
Anna Stuchlik	b0ced64c88	doc: remove the limitation for disabling CDC This commit removes the instruction to stop all writes before disabling CDC with ALTER. Fixes https://github.com/scylladb/scylla-docs/issues/4020 Closes scylladb/scylladb#24406	2025-06-10 12:53:09 +03:00
Robert Bindar	ca1a9c8d01	Add support for nodetool refresh --skip-reshape This patch adds the new option in nodetool, patches the load_new_ss_tables REST request with a new parameter and skips the reshape step in refresh if this flag is passed. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#24409 Fixes: #24365	2025-06-10 12:52:13 +03:00
David Garcia	62fdebfe78	chore: exclude OS and ENT from google Closes scylladb/scylladb#24417	2025-06-10 12:50:37 +03:00
Emil Maskovsky	b7e0a01fcc	raft: simplify voter handler code to not pass node references around Refactor the voter handler logic to only pass around node IDs (`raft::server_id`), instead of pairs of IDs and node descriptor references. Node descriptors can always be efficiently retrieved from the original nodes map, which remains valid throughout the calculation. This change reduces unnecessary reference passing and simplifies the code. All node detail lookups are now performed via the central nodes map as needed. Fixes: scylladb/scylladb#24035	2025-06-10 11:04:56 +02:00
Emil Maskovsky	839e0bf40d	raft: reformat voter handler for consistent indentation Reformatted the voter handler implementation to comply with clang-format automatic formatting rules. No functional changes.	2025-06-10 11:04:56 +02:00
Emil Maskovsky	05392e6ef3	raft: use explicit priority comparators for datacenters and racks Refactor the voter handler to use explicit priority comparator classes for datacenter and rack selection. This makes the prioritization logic more transparent and robust, and reduces the risk of subtle bugs that could arise from relying on implicit comparison operators.	2025-06-10 11:04:54 +02:00
Emil Maskovsky	e93bf3f05a	raft: clean up voter handler by removing redundant comments Remove comments from the group0 voter handler that simply restate the code or do not provide meaningful clarification. This improves code readability and maintainability by reducing noise and focusing on essential documentation.	2025-06-10 11:03:20 +02:00
Calle Wilund	80feb8b676	utils::http::dns_connection_factory: Use a shared certificate_credentials Fixes #24447 This factory type, which is really more a data holder/connection producer per connection instance, creates, if using https, a new certificate_credentials on every instance. Which when used by S3 client is per client and scheduling groups. Which eventually means that we will do a set_system_trust + "cold" handshake for every tls connection created this way. This will cause both IO and cold/expensive certificate checking -> possible stalls/wasted CPU. Since the credentials object in question is literally a "just trust system", it could very well be shared across the shard. This PR adds a thread local static cached credentials object and uses this instead. Could consider moving this to seastar, but maybe this is too much. Closes scylladb/scylladb#24448	2025-06-10 11:20:21 +03:00
Petr Gusev	e456d2d507	storage_proxy: log gate_closed_exception gate_closed_exception likely signals that we have shutdown order issues. If we just swallow it we lose information what exact component was shutdown prematurely. For example, we stopped local storage before group0 during shutdown in main.cc. If a group0 command arrives, topology_state_load might try to write something and get mutation_write_failure_exception, which results in 'applier fiber stopped because of the error'. There is no other information in the logs in this case, other than 'mutation_write_failure_exception'. It's not clear what the original problem is and what component is triggering it. In this commit we add a warning to the logs when gate_closed_exception is thrown from lmutate or rmutate. Another option is to just remove the try_catch_nested line and allow gate_closed_exception to be logged as an error below. However, this might break some tests which check ERROR lines in the logs.	2025-06-10 10:04:04 +02:00
Andrzej Jackowski	c4e8a2c44e	mapreduce: change next_vnode lambda to get_next_partition_range function The motivation of this code reorganization is to shorten the time when ERM is being kept, done later in this patch series. Ref. scylladb#21831	2025-06-10 09:06:17 +02:00
Michael Litvak	bd88ca92c8	test/cluster/test_tablets: test restart during tablet cleanup Add a test that reproduces issue scylladb/scylladb#23481. The test migrates a tablet from one node to another, and while the tablet is in some stage of cleanup - either before or right after, depending on the parameter - the leaving replica, on which the tablet is cleaned, is restarted. This is interesting because when the leaving replica starts and loads its state, the tablet could be in different stages of cleanup - the SSTables may still exist or they may have been cleaned up already, and we want to make sure the state is loaded correctly.	2025-06-09 17:27:45 +03:00
Michael Litvak	fb18fc0505	test: tablets: add get_tablet_info helper Add a helper for tests to get the tablet info from system.tablets for a tablet owning a given token.	2025-06-09 16:59:07 +03:00
Michael Litvak	34f15ca871	tablets: deallocate storage state on end_migration When a tablet is migrated and cleaned up, deallocate the tablet storage group state on `end_migration` stage, instead of `cleanup` stage: * When the stage is updated from `cleanup` to `end_migration`, the storage group is removed on the leaving replica. * When the table is initialized, if the tablet stage is `end_migration` then we don't allocate a storage group for it. This happens for example if the leaving replica is restarted during tablet migration. If it's initialized in `cleanup` stage then we allocate a storage group, and it will be deallocated when transitioning to `end_migration`. This guarantees that the storage group is always deallocated on the leaving replica by `end_migration`, and that it is always allocated if the tablet wasn't cleaned up fully yet. It is a similar case also for the pending replica when the migration is aborted. We deallocate the state on `revert_migration` which is the stage following `cleanup_target`. Previously the storage group would be allocated when the tablet is initialized on any of the tablet replicas - also on the leaving replica, and when the tablet stage is `cleanup` or `end_migration`, and deallocated during `cleanup`. This fixes the following issue: 1. A migrating tablet enters cleanup stage 2. the tablet is cleaned up successfuly 3. The leaving replica is restarted, and allocates storage group 4. tablet cleanup is not called because it was already cleaned up 4. the storage group remains allocated on the leaving replica after the migration is completed - it's not cleaned up properly. Fixes scylladb/scylladb#23481	2025-06-09 16:58:38 +03:00
Michael Litvak	8aeb404893	test_cdc_generation_clearing: wait for generations to propagate In test_cdc_generation_clearing we trigger events that update CDC generations, verify the generations are updated as expected, and verify the system topology and CDC generations are consistent on all nodes. Before checking that all nodes are consistent and have the same CDC generations, we need to consider that the changes are propagated through raft and take some time to propagate to all nodes. Currently, we wait for the change to be applied only on the first server which runs the CDC generation publisher fiber and read the CDC generations from this single node. The consistency check that follows could fail if the change was not propagated to some other node yet. To fix that, before checking consistency with all nodes, we execute a read barrier on all nodes so they all see the same state as the leader. Fixes scylladb/scylladb#24407 Closes scylladb/scylladb#24433	2025-06-09 12:59:04 +02:00
Gleb Natapov	bb29591daf	topology coordinator: Do not cancel global requests in cancel_all_requests This was mistakenly added by `fbd75c5c06`. The function is called after checking that no topology request can proceed, so it cancels them, but this has nothing to do with global request. Also, for some reason, the cancellation was added in the loop over topology requests.	2025-06-09 13:38:49 +03:00
Gleb Natapov	be0b328b19	topology coordinator: store request type for each global command	2025-06-09 13:38:49 +03:00
Gleb Natapov	00fd427be0	topology request: make it possible to hold global request types in request_type field topology_request table has a filed to hold a request type, but currently it can hold only per node requests. This patch makes it possible to store global request types there as well.	2025-06-09 13:38:49 +03:00
Gleb Natapov	3a496067c6	topology coordinator: move alter table global request parameters into topology_request table Currently parameters to alter table global topology command are stored in static column in the topology table, but this way there can be only one outstanding alter table request. This patch moves the parameters to the topology_request table where parameters are stored per request.	2025-06-09 13:38:49 +03:00
Gleb Natapov	a9244bf037	topology coordinator: move cleanup global command to report completion through topology_request table We want to unify all command to report completion through the topology_requests table.	2025-06-09 13:38:49 +03:00
Gleb Natapov	6a52ba2251	topology coordinator: no need to create updates vector explicitly	2025-06-09 13:38:49 +03:00
Gleb Natapov	69dacb5894	topology coordinator: use topology_request_tracking_mutation_builder::done() instead of open code it	2025-06-09 13:38:49 +03:00
Gleb Natapov	7257391c8f	topology coordinator: handle error during new_cdc_generation command processing Currently if there is an error during new_cdc_generation command it is retried in a loop. Since the status of the command executing is now reported through the topology request table we can fail the command instead,	2025-06-09 13:38:48 +03:00
Gleb Natapov	389f0f6280	topology coordinator: remove unneeded semicolon	2025-06-09 13:38:48 +03:00
Gleb Natapov	ba371c09fc	topology coordinator: fix indentation after the last commit	2025-06-09 13:38:48 +03:00
Gleb Natapov	b8c11f330a	topology coordinator: move new_cdc_generation topology request to use topology_request table for completion Currently it checks the completion by waiting for new generation to appear, but we want to unify all commands to check for completion in topology_request table.	2025-06-09 13:38:48 +03:00
Gleb Natapov	6d09c76a12	gms/feature_service: add TOPOLOGY_GLOBAL_REQUEST_QUEUE feature flag Will be needed to coordinate between old and new nodes during upgrade.	2025-06-09 13:38:48 +03:00
Anna Stuchlik	93a7146250	doc: add redirections to fix 404 This commit adds redirections for pages on the master branch that were unexpectedly indexed by Google. Those pages no longer exist and return 404. Fixes https://github.com/scylladb/scylladb/issues/24397 Closes scylladb/scylladb#24422	2025-06-09 12:38:10 +02:00
Pavel Emelyanov	46557b3927	table: Touch and sync snapshot directory only once The table::take_snapshot() touches the snapshot directory, which is good. It happens on all shards, which is not that good, because all shards just step on each other toes when doing it, the directory is not sharded. Same for post-snapshot directory sync -- it can happen once, after all shards finish creating snapshot links. Move both, touching and syncing up one level. There's only one caller of the method, so only one caller to update. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24154	2025-06-09 13:36:49 +03:00
Michał Chojnowski	7d26d3c7cb	db/config: add an option that disables dict-aware sstable compressors in DDL statements For reasons, we want to be able to disallow dictionary-aware compressors in chosen deployments. This patch adds a knob for that. When the knob is disabled, dictionary-aware compressors will be rejected in the validation stage of CREATE and ALTER statements. Closes scylladb/scylladb#24355	2025-06-09 13:30:40 +03:00
Raphael S. Carvalho	2d716f3ffe	replica: Fix truncate assert failure Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed a preexisting fragility which I missed. 1) truncate gets RP mark X, truncated_at = second T 2) new sstable written during snapshot or later, also at second T (difference of MS) 3) discard_sstables() get RP Y > saved RP X, since creation time of sstable with RP Y is equal to truncated_at = second T. So the problem is that truncate is using a clock of second granularity for filtering out sstables written later, and after we got low mark and truncate time, it can happen that a sstable is flushed later within the same second, but at a different millisecond. By switching to a millisecond clock (db_clock), we allow sstables written later within the same second from being filtered out. It's not perfect but extremely unlikely a new write lands and get flushed in the same millisecond we recorded truncated_at timepoint. In practice, truncate will not be used concurrently to writes, so this should be enough for our tests performing such concurrent actions. We're moving away from gc_clock which is our cheap lowres_clock, but time is only retrieved when creating sstable objects, which frequency of creation is low enough for not having significant consequences, and also db_clock should be cheap enough since it's usually syscall-less. Fixes #23771. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24426	2025-06-08 15:59:15 +03:00
Nadav Har'El	a714079a62	Merge 'Add Support for Per-Table Metrics in Alternator' from Amnon Heiman This series introduces per-table metrics support for Alternator. It includes the following commits: Add optional per-table metrics for Alternator Introduces a shared_ptr-based mechanism that allows Alternator to register per-table metrics. These metrics follow the table's lifecycle, similar to how CQL metrics are handled. The use of shared_ptr ensures no direct dependency between table stats and Alternator. Enable registration of stats objects per table Adds support for registering a stats object using a keyspace and table name. Per-table metrics are prefixed with alternator_table to differentiate them from per-shard metrics. Metrics are reported once per node, and those not meaningful at the table level (e.g. create/delete) are excluded. All metrics use the skip_when_empty flag. Update per-table metrics handling Adds a helper function to retrieve the stats object from a table schema. Updates both per-shard and per-table metrics, resulting in some code duplication. Add tests for per-table metrics Extends existing tests to also validate the per-table metrics. These tests ensure that the new metrics are correctly registered and updated. This series improves observability in Alternator by enabling fine-grained per-table metrics without disrupting existing per-shard metrics. No need to backport Fixes #19824 Closes scylladb/scylladb#24046 * github.com:scylladb/scylladb: alternator/test_metrics.py: Test the per-table metrics alternator/executor.cc: Update per-table metrics alternator/stats: Add per-table metrics replica/database.hh: Add alternator per-table metrics alternator/stats.hh: Introduce a per-table stats container	2025-06-08 10:42:05 +03:00
Botond Dénes	8498bd6376	Merge 'Replace container_to_vec with std::ranges' from Pavel Emelyanov The helper in question converts an iterable collection to a vector of fmt::to_string()-s of the collection elements. Patch the caller to use standard library and remove the helper. Closes scylladb/scylladb#24357 * github.com:scylladb/scylladb: api: Drop no longer used container_to_vec helper api: Use std::ranges to stringify collections api: Use std::ranges to convert std::set<sstring> to std::vector<string> api: Use db::config::data_file_directories()' vector directly api: Coroutinize get_live_endpoint()	2025-06-06 10:57:06 +03:00
Pavel Emelyanov	12420dc644	api: Shorten get_host_to_id_map() handler The handler does - gets host IDs from local token metadata - for each ID gets the host IP and generates IP:ID std::pair - converts the sequence of generated pairs into std::unordered_map - converts the unordered map into vector of jsonable key:value objects This patch removes the 3rd step and makes the needed jsonable object in step 2 directly, thus eliminating the interposing unordered_map creation. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24354	2025-06-06 10:54:23 +03:00
Pavel Emelyanov	428edd41f5	api: Make us of datablse::get_all_keyspaces() There are two places in the API that want to get the list of keyspace names. For that they call database::get_keyspaces() and then extract keys from the returned name to class keyspace map. There's a database::get_all_keyspaces() method that does exactly that. Remove the map_keys helper from the api/api.hh that becomes unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24353	2025-06-06 10:53:09 +03:00
Marcin Maliszkiewicz	2090e44283	storage_service: always wake up load balancer on update tablet metadata Lack of wakeup is error-prone, as it relies on a wakeup occurring elsewhere.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	ddc0656eb5	db: schema_applier: call destroy also when exception occurs Otherwise objects may be destroyed on wrong shard, and assert will trigger in ~sharded().	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	547bb1f663	db: replica: simplify seeding ERM during shema change We know that caller is running on shard 0 so we can avoid some extra boilerplate.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	97cdb72d4d	db: remove cleanup from add_column_family Since we abort now on failure during schema commit there is no need for cleanup as it only manages in-memory state. Explicit cf.stop was added to code paths outside of schema merging to avoid unnecessary regressions.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	d5075c70ef	db: abort on exception during schema commit phase As we have no way to recover from partial commit.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	858db822dc	db: make user defined types changes atomic The same order of creation/destruction is preserved as in the original code, looking from single shard point of view. create_types() is called on each shard separately, while in theory we should be able reuse results similarly as diff_rows(). But we don't introduce this optimization yet.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	5b2e4140cc	replica: db: make keyspace schema changes atomic Now all keyspace related schema changes are observable on given shard as they would be applied atomically. This is achieved by commit_on_shard() function being non-preemptive (no futures, no co_awaits). In the future we'll extend this to the whole schema and also other subsystems.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	556e89bc9d	db: atomically apply changes to tables and views In this commit we make use of splitted functions introduced before. Pattern is as follows: - in merge_tables_and_views we call some preparatory functions - in schema_applier::update we call non-yielding step - in schema_applier::post_commit we call cleanups and other finalizing async functions Additionally we introduce frozen_schema_diff because converting schema_ptr to global_schema_ptr triggers schema registration and with atomic changes we need to place registration only in commit phase. Schema freezing is the same method global_schema_ptr uses to transport schema across shards (via schema_registry cache).	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	a27776b4ff	replica: make truncate_table_on_all_shards get whole schema from table_shards Before for views and indexes it was fetching base schema from db (and couple other properties). This is a problem once we introduce atomic tables and views deletion (in the following commit). Because once we delete table it can no longer be fetched from db object, and truncation is performed after atomically deleting all relevant tables/views/indexes. Now the whole relevant schema will be fetched via global_table_ptr (table_shards) object.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	ac254e9722	service: split update_tablet_metadata into two phases In following commits calls will be split in schema_applier.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	21a5a3c01f	service: pull out update_tablet_metadata from migration_listener It's not a good usage as there is only one non-empty implementation. Also we need to change it further in the following commit which makes it incompatible with listener code.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	92e3d69f79	db: service: add store_service dependency to schema_applier There is already implicit logical dependency via migration_notifier but in the next commits we'll be moving store_service out from it as we need better control (i.e. return a value from the call).	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	1c8fd3a65d	service: simplify load_tablet_metadata and update_tablet_metadata - remove load_tablet_metadata(), instead we add wake_up_load_balancer flag to update_tablet_metadata(), it reduces number of public functions and also serves as a comment (removed comment with very similar meaning) - reimplement the code to not use mutate_token_metadata(), this way it's more readable and it's also needed as we'll split update_tablet_metadata() in following commits so that we can have subroutine which doesn't yield (for ensuring atomicity)	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	3119a02edd	db: don't perform move on tablet_hint reference This lambda is called several times so there should be no move. Currently the bug likely doesn't manifest as code does work only on shard 0.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	1ad14f02f1	replica: split add_column_family_and_make_directory into steps This is similar work as for drop_table in previous commit. add_column_family_and_make_directory() behaves exactly the same as before but calls to it in schema_applier will be replaced by calls directly to split steps. Other usages will remain intact as they don't need atomicity (like creating system tables at startup).	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	141a5643e5	replica: db: split drop_table into steps This is done so that actual dropping can be an atomic step which could be composed with other schema operations, and eventually all subsystems modified via raft so that we could introduce atomic changes which span across different subsystems. We split drop_table_on_all_shards() into: - prepare_tables_metadata_change_on_all_shards() - prepare_drop_table_on_all_shards() - drop_table() - cleanup_drop_table_on_all_shards() prepare_tables_metadata_change_on_all_shards() is necessary because when applying multiple schema changes at once (e.g. drop and add tables) we need to lock only once. We add legacy_drop_table_on_all_shards() which behaves exactly like old drop_table_on_all_shards() to be compatible with code which doesn't need to play with atomicity. Usages of legacy_drop_table_on_all_shards() in schema_applier will be replaced with direct calls to split functions in the following commits - that's the place we will take advantage of drop_table not yielding (as it returns void now).	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	2bae38e252	db: don't move map references in merge_tables_and_views() Since they are const it's not needed and misleading.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	85f19e165a	db: introduce commit_on_shard function This will be the place for all atomic schema switching operations. Note that atomicity is observed only from single shard point of view. All shards may switch at slightly different times as global locking for this is not feasible.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	b3730282c3	db: access types during schema merge via special storage Once we create types atomically the code which is before commit may depend on newly added types, so it has to access both old and new types. New storage called in_progress_types_storage was added.	2025-06-06 08:50:33 +02:00
Pavel Emelyanov	f5743c6afc	Merge 'test/alternator: make tests runnable on DynamoDB Local' from Nadav Har'El The Alternator tests should pass on Alternator (of course), and almost always also on DynamoDB to verify that the tests themselves are correct and don't just enshrine Alternator's incorrect behavior. Although much less important, it is sometimes useful to be able to check if the test also pass on other DynamoDB clones, especially "DynamoDB Local" - Amazon's DynamoDB mock written in Java. In issue https://github.com/scylladb/scylladb/issues/7775 we noted that some of our tests don't actually pass on DynamoDB Local, for different reasons, but at the time that issue was created most of the tests did work. However, checking now on a newer version of DynamoDB Local (2.6.1), I notice that _all_ tests failed because of some silly reasons that are easy to fix - and this is what the two patches in this series fix. After these fixes, most of the Alternator tests pass on DynamoDB Local. But not all of them - #7775 is still open. No backport needed - these are just test framework improvements for developers. Closes scylladb/scylladb#24361 * github.com:scylladb/scylladb: test/alternator: any response from healthcheck means server is alive test/alternator: fall back to legal-looking access key id	2025-06-06 08:50:58 +03:00
Nadav Har'El	b0f98f7d4b	mv: test that view's SELECT automatically includes primary key Both ScyllaDB's and Datastax's documentation suggest that when creating a view with CREATE MATERIALIZED VIEW, its SELECT clause doesn't need to list the view's primary key columns because those are selected automatically. For example, our documentation has an example in https://docs.scylladb.com/manual/stable/features/materialized-views.html ``` CREATE MATERIALIZED VIEW building_by_city2 AS SELECT meters FROM buildings WHERE city IS NOT NULL PRIMARY KEY(city, name); ``` Note how the primary key columns - city and name - are not explicitly SELECTed. I just discovered that while this behavior was indeed true in Cassandra 3 (and still true in ScyllaDB), it actually got broken in Cassandra 4 and 5. I reported this apprent regression to Cassandra (CASSANDRA-20701), and proposing the regression test in this patch to ensure that Scylla can't suffer a similar regression in the future. The new test passes on ScyllaDB and Cassandra 3, but fails on Cassandra 4 and 5 (and therefore tagged with "cassandra_bug"). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24399	2025-06-05 16:52:49 +02:00
Piotr Szymaniak	de96c28625	alternator: Add support for TTL when using tablets Support for TTL-based data removal when using tablets. The essence of this commit is a separate code path for finding token ranges owned by the current shard for the cases when tablets are used and not vnodes. At the same time, the vnodes-case is not touched not to cause any regressions. The TTL-caused data removal is normally performed by the primary replica (both when using vnodes and tablets). For the tablets case, the already-existing method tablet_map::get_primary_replica(tablet_id) is used to know if a shard execuring the TTL-related data removal is the primary replica for each tablet. A new method tablet_map::get_secondary_replica(tablet_id) has been added. It is needed by the data invalidation procedure to remove data when the primary replica node is down - the data is then removed by the secondary replica node. The mechanism is the same as in the vnodes case. Since alternator now supports TTL, the test `test_ttl_enable_error_with_tablets` has been removed. Also, tests in the test_ttl.py have been made to run twice, once with vnodes and once with tablets. When run with tablets, the due to lack of support for LWT with tablets (#18068), tests use 'system:write_isolation' of 'unsafe_rmw'. This approach allows early regression testing with tablets and is meant only as a tentative solution. Fixes scylladb/scylladb#16567 Closes scylladb/scylladb#23662	2025-06-05 17:39:29 +03:00
Amnon Heiman	760c8c3333	alternator/test_metrics.py: Test the per-table metrics This patch adds tests for the newly added per-table metrics. It mainly redoes existing tests, but verifies that the per-table metrics are updated correctly. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-06-05 15:12:19 +03:00
Amnon Heiman	3ad7a24eee	alternator/executor.cc: Update per-table metrics This patch adds support for updating per-table metrics. It introduces a helper function that retrieves the stats object from a table schema. The code uses a lw_shared_ptr for the stats object to ensure safe updates even if the table holding it has been deleted. There is some duplication in the updated code, as both per-shard and per-table metrics are updated. The rmw_operation::execute function now accepts two stats objects: one for the global metrics and one for the per-table metrics. The use of execute was also modified—rather than modifying the WCU directly, a parameter is used so both global and per-table stats can be updated. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-06-05 15:12:13 +03:00
Amnon Heiman	d6afd42342	alternator/stats: Add per-table metrics This patch allows registering a stats object per table. The per-table stats object needs its metrics registry to be part of the table's lifecycle, but there could be a scenario in which a table is already deleted while some Alternator operations are still in progress. To handle this, the patch separates the registry from the metrics holder. It is safe to modify a parameter that is not registered. Metrics registration is performed via functions instead of the constructor. The registration accepts a keyspace and table name as parameters. The per-table metrics use an alternator_table prefix to distinguish them from their per-shard equivalents. The metrics are aggregated and reported once per node. Metrics that do not make sense to report per table (such as create and delete) are not registered. All metrics are marked with skip_when_empty. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-06-05 14:44:03 +03:00
Amnon Heiman	005df0c5c4	replica/database.hh: Add alternator per-table metrics This patch adds optional per-table metrics for Alternator. Like CQL, some of Alternator's statistics should be per-table. The shared_ptr allows Alternator to register such metrics in a way that makes them part of the table's lifecycle. Using a shared_ptr does not create dependencies between the table_stats and Alternator. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-06-05 14:38:14 +03:00
Amnon Heiman	af262317b5	alternator/stats.hh: Introduce a per-table stats container A per-table stats container will be used to safely hold alternator per-table stats. It is build in a way that even if the metrics it holds are no longer registered, it is still safe to use. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-06-05 14:38:14 +03:00
Andrzej Jackowski	e6eb741e95	test: dtest: add dict support to populate in scylla_cluster.py Co-authored-by: Evgeniy Naydanov <evgeniy.naydanov@scylladb.com>	2025-06-05 08:20:09 +02:00
Andrzej Jackowski	e3f052d6fb	test: dtest: copied get_node_ip from dtests to scylla_cluster.py Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>	2025-06-05 08:20:09 +02:00
Andrzej Jackowski	40e71ad1e6	test: dtest: copy run_rest_api from dtests to cluster.py Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>	2025-06-05 08:20:09 +02:00
Andrzej Jackowski	3da86f04a5	test: dtest: copy run_in_parallel from dtests to data.py Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>	2025-06-05 08:19:54 +02:00
Andrzej Jackowski	a1b1d810f9	test: audit: copy unmodified audit_test.py from dtests Copied the entire audit_test.py from scylladb/scylla-dtest, to remove the entire file from scylla-dtest after this patch series is merged. The motivation is to move entire audit testing to from dtests, to make it easier to maintain and more reliable. Changed suite.yaml, to prevent audit_test.py from running because audit_test.py needs improvement before it starts passing. Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>	2025-06-05 08:19:44 +02:00
Ernest Zaslavsky	a39b773d36	encryption_test: Catch exact exception Apparently `test_kms_network_error` will succeed at any circumstances since most of our exceptions derive from `std::exception`, so whatever happens to the test, for whatever reason it will throw, the test will be marked as passed. Start catching the exact exception that we expect to be thrown. Maybe somewhat related to https://github.com/scylladb/scylladb/issues/22628 Fixes: https://github.com/scylladb/scylladb/issues/24145 reapplies reverted: https://github.com/scylladb/scylladb/pull/24065 Should be backported to 2025.2. Closes scylladb/scylladb#24242	2025-06-05 08:32:51 +03:00
Benny Halevy	8b387109fc	disk_space_monitor: add space_source_registration Register the current space_source_fn in an RAII object that resets monitor._space_source to the previous function when the RAII object is destroyed. Use space_source_registration in database_test:: mutation_dump_generated_schema_deterministic_id_version to prevent use-after-stack-return in the test. Fixes #24314 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24342	2025-06-04 16:25:24 +03:00
Ernest Zaslavsky	1446f57635	minio: update CLI usage, remove deprecated `mc` options Replace phased-out `mc` command options with supported alternatives. Ensures compatibility with the latest MinIO version. Closes scylladb/scylladb#24363	2025-06-04 16:22:48 +03:00
Anna Stuchlik	8b989d7fb1	doc: add the upgrade guide from 2025.1 to 2025.2 This commit adds the upgrade guide from version 2025.1 to 2025.2. Also, it removes the upgrade guides existing for the previous version that are irrelevant in 2025.2 (upgrade from OSS 6.2 and Enterprise 2024.x). Note that the new guide does not include the "Enable Consistent Topology Updates" page, as users upgrading to 2025.2 have consistent topology updates already enabled. Fixes https://github.com/scylladb/scylladb/issues/24133 Fixes https://github.com/scylladb/scylladb/issues/24265 Closes scylladb/scylladb#24266	2025-06-04 14:00:05 +03:00
Szymon Malewski	5969809607	mapreduce_service: Prevent race condition In parallelized aggregation functions super-coordinator (node performing final merging step) receives and merges each partial result in parallel coroutines (`parallel_for_each`). Usually responses are spread over time and actual merging is atomic. However sometimes partial results are received at the similar time and if an aggregate function (e.g. lua script) yields, two coroutines can try to overwrite the same accumulator one after another, which leads to losing some of the results. To prevent this, in this patch each coroutine stores merging results in its own context and overwrites accumulator atomically, only after it was fully merged. Comparing to the previous implementation order of operands in merging function is swapped, but the order of aggregation is not guaranteed anyway. Fixes #20662 Closes scylladb/scylladb#24106	2025-06-04 13:47:11 +03:00
Nadav Har'El	6cbcabd100	alternator: hide internal tags from users The "tags" mechanism in Alternator is a convenient way to attach metadata to Alternator tables. Recently we have started using it more and more for internal metadata storage: * UpdateTimeToLive stores the attribute in a tag system:ttl_attribute * CreateTable stores provisioned throughput in tags system:provisioned_rcu and system:provisioned_wcu * CreateTable stores the table's creation time in a tag called system:table_creation_time. We do not want any of these internal tags to be visible to a ListTagsOfResource request, because if they are visible (as before this patch), systems such as Terraform can get confused when they suddenly see a tag which they didn't set - and may even attempt to delete it (as reported in issue #24098). Moreover, we don't want any of these internal tags to be writable with TagResource or UntagResource: If a user wants to change the TTL setting they should do it via UpdateTimeToLive - not by writing directly to tags. So in this patch we forbid read or write to any tag that begins with the "system:" prefix, except one: "system:write_isolation". That tag is deliberately intended to be writable by the user, as a configuration mechanism, and is never created internally by Scylla. We should have perhaps chosen a different prefix for configurable vs. internal tags, or chosen more unique prefixes - but let's not change these historic names now. This patch also adds regression tests for the internal tags features, failing before this patch and passing after: 1. internal tags, specifically system:ttl_attribute, are not visible in ListTagsOfResource, and cannot be modified by TagResource or UntagResource. 2. system:write_isolation is not internal, and be written by either TagResource or UntagResource, and read with ListTagsOfResource. This patch also fixes a bug in the test where we added more checks for system:write_isolation - test_tag_resource_write_isolation_values. This test forgot to remove the system:write_isolation tags from test_table when it ended, which would lead to other tests that run later to run with a non-default write isolation - something which we never intended. Fixes #24098. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24299	2025-06-03 20:40:50 +03:00
Pavel Emelyanov	37e6ff1a3c	Merge 'test.py: cql: run tests using bare pytest command' from Evgeniy Naydanov Create a custom pytest test collector for .cql files and move CQL test execution logic from `CQLApprovalTest` class and `pylib/cql_repl/cql_repl.py` file to `CqlTest.runtest()` method. In result, the only difference between CQLApproval and Python suite types is suffixes of test files. Also there is a separate commit to remove dead code: There is `write_junit_failure_report()` method in Test class which was used to generate a JUnitXML report. But it became a dead code after removal of `write_junit_report()` function in `1e1d213592` to avoid duplication of error reporting in Jenkins (see https://github.com/scylladb/scylladb/issues/23220.) This commit removes this method and all its implementations in subclasses. Closes scylladb/scylladb#24301 * github.com:scylladb/scylladb: test.py: cql: don't exit from pytest session on failed CQL test.py: cql: run tests using bare pytest command test.py: python: set test.id according to --run_id argument test.py: python: pass --tmpdir from test.py to all Python tests test.py: remove dead code after removing of write_junit_report()	2025-06-03 19:32:06 +03:00
Pavel Emelyanov	24f430c6d2	Merge 'test.py: dtest: port next_gating tests from auth_roles_test.py' from Evgeniy Naydanov Copy `auth_roles_test.py` from scylla-dtest test suite, remove all not next_gating tests from it, and make it works with `test.py` As a part of the porting process, copy missed utility functions from scylla-dtest, remove unused imports and markers. Enable the test in `suite.yaml` (run in dev mode only.) Closes scylladb/scylladb#24343 * github.com:scylladb/scylladb: test.py: dtest: make auth_roles_test.py run using test.py test.py: dtest: add wait_for_any_log() to tools/log_utils.py test.py: dtest: add part of tools/assertions.py test.py: dtest: pickup latest code for retrying.py from dtest test.py: dtest: copy unmodified auth_roles_test.py	2025-06-03 18:54:47 +03:00
Patryk Jędrzejczak	8756c233e0	test: test_raft_recovery_user_data: disable hinted handoff The test is currently flaky, writes can fail with "Too many in flight hints: 10485936". See scylladb/scylladb#23565 for more details. We suspect that scylladb/scylladb#23565 is caused by an infrastructure issue - slow disks on some machines we run CI jobs on. Since the test fails often and investigation doesn't seem to be easy, we first deflake the test in this patch by disabling hinted handoff. For replacing nodes, we provide `cfg` because there should have been `cfg` in the first place. The test was correct anyway because: - `tablets_mode_for_new_keyspaces` is set to `true` by default in test/cluster/suite.yaml, - `endpoint_snitch` is set to `GossipingPropertyFileSnitch` by default if the property file is provided in `ScyllaServer.__init__`. Ref scylladb/scylladb#23565 We should backport this patch to 2025.2 because this test is also flaky on CI jobs using 2025.2. Older branches don't have this test. Closes scylladb/scylladb#24364	2025-06-03 17:48:42 +02:00
Nadav Har'El	ac70e34de9	test/alternator: verify that DeleteItem returns an empty object A user on StackOverflow (https://stackoverflow.com/questions/79650278) reported that DeleteItem returns the apropriate response (an empty object) on DynamoDB, but doesn't on "DynamoDB Local" (Amazon's local mock of DynamoDB). I wrote the test in this patch to make sure that Alternator doesn't have this bug, and indeed it doesn't: When DeleteItem is used without any option that asks for additional output, its reponse is, as expected, an empty object. As usual, the new test passes on both Alternator and AWS DynamoDB. (I didn't actually test on DynamoDB Local, I have some problems with running that, but it doesn't matter, we have no intention of testing DynamoDB Local). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24359	2025-06-03 18:47:34 +03:00
Avi Kivity	744015cf26	test.py: allow cmake configuration and ./configure.py configuration to coexist Cmake emits its build.ninja into build/, while configure.py emits build.ninja into ./. test.py uses this difference to choose the directory structure to test. The problem is that vscode will randomly call cmake to understand the directory structure, so we end up with both build.ninja set up. Invert the logic to look for ./build.ninja to determine the mode (instead of build/build.ninja which can exist even if the user uses traditional configuration). It can still happen that a stray ./build.ninja exists (for example due to switching branches), but that is rarer than having vscode auto-create it. Closes scylladb/scylladb#24269	2025-06-03 16:46:41 +03:00
Piotr Dulikowski	f6669422e1	Merge 'test.py: refactor test facades for better error handling' from Andrei Chekun Switching to f-string formatting to simplify the code and to unify it with a general approach for formatting strings. If the log file absent or empty test fails with an error regarding a missing boost log file, however, it's not helpful since it's not a root cause of the fail. Adding logic to log this issue as a warning in a pytest's log file and continue with providing results to the pytest itself. Closes scylladb/scylladb#24307 * github.com:scylladb/scylladb: test.py: enhance boost_facade missing log file handling test.py: switch using f-string instead format in facades	2025-06-03 14:03:07 +02:00
Pavel Emelyanov	96029c7c93	Update seastar submodule * seastar d7ff58f2...26badcb1 (22): > http/client: Skip HEAD reply body processing > httpd: Remove unused connection::_req member > httpd: Don't write body for HEAD replies > http: Move trailing chunk write into reply.cc > http_client: Add ECONNRESET to retryable errors > stall_detector: no backtrace if exception > http: Add test for "aborted" client > http: in the client, fix malforming of requests with zero-sized bodies > http: Track bytes read from a response > http: Add test for improper client handling of aborted requests > aio_storage_context: Rename iocb_pool::_iocb_pool to _all_iocbs > resource: Add some debug-level logging to memory allocation > resource: Rework sysconf memory fallback > resource: Indentation fix after previous patch > resource: Calculate available memory from NUMA nodes > resource: Move NUMA nodes vector evaluation up > reactor: Drop _reuseport boolean > reactor: Simplify network stack creation and initialization > reactor: Remove write-only _thread_id > reactor: Keep task-queues in std::array instead of static_vector > reactor: Mark _id and task_queue::_id const > memory: Report oversized alloc count as metric scylla-gdb update included: The reactor::_task_queues can be std::array or unique ptrs. Also check the tq_ptr for being nullptr, as array doesn't have "size" only "capacity" and can have non-registered groups. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24294	2025-06-03 13:47:05 +03:00
Nadav Har'El	e32559758a	test/alternator: any response from healthcheck means server is alive In the Alternator tests we check (in dynamodb_test_connect()) after every test that the server is still alive, so we can blaim the test that just ran if it crashes the server. We check the server's health using a simple GET response, which works on both DynamoDB and Alternator, e.g., ``` $ curl http://dynamodb.us-east-2.amazonaws.com/ healthy: dynamodb.us-east-2.amazonaws.com ``` However, it turns out that new versions of DynamoDB Local - Amazon's local mock of DynamoDB, for some reason insists that all requests - including this health check - must be signed, so our unsigned health request is rejected with error 400, saying the request must be signed. So the current code which insists that the response have error code 200, fails and the test incorrectly things that DynamoDB Local crashed during the test. The fix is trivial: Just don't check that the error code is 200. Any HTTP response from the server means it is still alive! If the server is not alive, we will get an exception, not any HTTP response, and this will lead the code to the "server has crashed" case. Refs #7775 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-06-03 12:25:51 +03:00
Nadav Har'El	9732545958	test/alternator: fall back to legal-looking access key id When the Alternator tests run against Scylla, they figure out (using CQL) the correct username and password needed to connect. When it can't, we fell back to some silly pair 'unknown_user', 'unknown_secret', assuming that the server won't check it anyway. It turns out that if we want to run tests against new version of DynamoDB Local (Amazon's local mock of DynamoDB), it indeed doesn't authentication, but starting in DynamoDB Local 2.0, it does check that the access key ID (the username) itself is valid, and considers "unknown_user" to be invalid because it contains an underscore - AWS_ACCESS_KEY_ID must only contains letters and numbers. See https://repost.aws/articles/ARc4hEkF9CRgOrw8kSMe6CwQ/ for Amazon's explanation for this change in DynamoDB Local 2. The trivial fix is to remove the underscore from the silly username. After this patch, Alternator tests can connect to DynamoDB Local. They still can't complete correctly - this will be fixed in the next patch. Refs #7775 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-06-03 12:25:51 +03:00
Evgeniy Naydanov	f0d283afd7	test.py: cql: don't exit from pytest session on failed CQL There is the fixture in `test/cql/conftest.py` which checks CQL connection after each test and exit from pytest session if the connection was failed. For CQL tests it's simply no difference what to use: pytest.exit() or pytest.fail() because tests are executing one-by-one in separate pytest sessions. Change it to pytest.fail() for future integration into a single pytest session.	2025-06-03 07:54:51 +00:00
Evgeniy Naydanov	cdc4b520da	test.py: cql: run tests using bare pytest command Create a custom pytest test collector for .cql files and move CQL test execution logic from `CQLApprovalTest` class and `pylib/cql_repl/cql_repl.py` file to `CqlTest.runtest()` method. In result, the only difference between CQLApproval and Python suite types is suffixes of test files.	2025-06-03 07:54:51 +00:00
Evgeniy Naydanov	0fba0df4f6	test.py: python: set test.id according to --run_id argument test.py uses `Test.id` attribute to distinguish repeated tests in one run and pass it as `--run_id` CLI argument to pytest. Use this argument to set the test's `id` attribute inside pytest session to fix problem with paths to some test artifacts.	2025-06-03 07:54:51 +00:00
Michał Chojnowski	ea4d251ad2	compress: fix a use-after-free in `dictionary_holder::get_recommended_dict()` The function calls copy() on a foreign_ptr (stored in a map) which can be destroyed (erased from the map) before the copy() completes. This is illegal. One way to fix this would be to apply an rwlock to the map. Another way is to wrap the `foreign_ptr` in a `lw_shared_ptr` and extend its lifetime over the `copy()` call. This patch does the latter. Fixes scylladb/scylladb#24165 Fixes scylladb/scylladb#24174 Closes scylladb/scylladb#24175	2025-06-03 10:42:38 +03:00
Piotr Dulikowski	f5b18d275b	Merge 'test/boost: Adjust tests to RF-rack-valid keyspaces' from Dawid Mędrek This PR adjusts existing Boost tests so they respect the invariant introduced by enabling `rf_rack_valid_keyspaces` configuration option. We disable it explicitly in more problematic tests. After that, we enable the option by default in the whole test suite. Fixes scylladb/scylladb#23958 Backport: backporting to 2025.1 and 2025.2 to be able to test the implementation there too. Closes scylladb/scylladb#23802 * github.com:scylladb/scylladb: test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity	2025-06-03 08:43:34 +02:00
Evgeniy Naydanov	ac972231fa	test.py: python: pass --tmpdir from test.py to all Python tests `--tmpdir` CLI argument is used to point to the directory with logs and other test artifacts. It has default values both in test.py and pytest (`test/conftest.py`). These values are the same. But for non-default values it's required to pass it from test.py to pytest explicitly. This done for Topology tests, but not for all Python test suites. The commit fixes the problem by adding the argument in `_prepare_pytest_command()` method of the base `PythonTest` class.	2025-06-03 05:45:05 +00:00
Evgeniy Naydanov	17401aaf31	test.py: remove dead code after removing of write_junit_report() There is `write_junit_failure_report()` method in Test class which was used to generate a JUnitXML report. But it became a dead code after removal of `write_junit_report()` function in `1e1d213592` to avoid duplication of error reporting in Jenkins (see #23220.) This commit removes this method and all its implementations in subclasses.	2025-06-03 02:28:41 +00:00
Pavel Emelyanov	eb5160cb4d	api: Drop no longer used container_to_vec helper All callers are patched to use std::ranges. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-06-02 20:09:58 +03:00
Pavel Emelyanov	f6afc02951	api: Use std::ranges to stringify collections There are several endpoints that have collection of objects at hand and want a vector of corresponding strings. Use std::ranges library for conversion. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-06-02 20:09:56 +03:00
Pavel Emelyanov	b943902ff7	api: Use std::ranges to convert std::set<sstring> to std::vector<string> The column_family/get_sstables_for_key endpoint collects a set of sstable names and converts it to vector of strings using homebrew helper. The std::ranges convertor works just as nice. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-06-02 20:09:28 +03:00
Pavel Emelyanov	6809ab5198	api: Use db::config::data_file_directories()' vector directly The return value is std::vector<sstring>, there's no need to additionally convert it to std::vector<sstring>. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-06-02 20:06:33 +03:00
Pavel Emelyanov	06ee60c238	api: Coroutinize get_live_endpoint() To be summetrical with its get_down_endpoint() peer and to make further patching simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-06-02 19:52:55 +03:00
Michał Chojnowski	dd878505ca	test: add test_sstable_compression_dictionaries_upgrade.py	2025-06-02 15:49:29 +02:00
Michał Chojnowski	d3cb873532	test.py: add --run-internet-dependent-tests Later, we will add upgrade tests, which need to download the previous release of Scylla from the internet. Internet access is a major dependency, so we want to make those tests opt-in for now.	2025-06-02 15:49:29 +02:00
Michał Chojnowski	5da19ff6a6	pylib/manager_client: add server_switch_executable Add an util for switching the Scylla executable during the test. Will be used for upgrade tests.	2025-06-02 15:03:08 +02:00
Michał Chojnowski	1ff7e09edc	test/pylib: in add_server, give a way to specify the executable and version-specific config This will be used for upgrade tests. The cluster will be started with an older executable and without configs specific to newer versions.	2025-06-02 15:03:08 +02:00
Michał Chojnowski	2ef0db0a6b	pylib: pass scylla_env environment variables to the topology suite I want to add an upgrade test under the topology suite. To work, it will have to know the path to the tested Scylla executable, so that it can switch the nodes to it. The path could be passed by various means and I'm not sure which what method is appropriate. In some other places (e.g. the cql suite) we pass the path via the `SCYLLA` environment variable and this patch follows that example. `PythonTestSuite` (parent class of `TopologySuite`) already has that variable set in `self.scylla_env`, and passes it around. However, `TopologySuite` uses its own `run()`, and so it implicitly overrides the decision to pass `self.scylla_env` down. This patch changes that, and after the patch we apply the `self.scylla_env` to the environment for topology tests. This might has some unforeseen side effects for coverage measurement, because AFAICS the (only) other variable in `self.scylla_env` is `LLVM_PROFILE_FILE`. But topology tests don't run Scylla executables themselves (they only send command to the cluster manager started externally), so I figure there should be no change.	2025-06-02 15:03:08 +02:00
Michał Chojnowski	34098fbd1f	test/pylib: add get_scylla_2025_1_executable() Adds a function which downloads and installs (in `~/.cache`) the Scylla 2025.1, for upgrade tests. Note: this introduces an internet dependency into pylib, AFAIK the first one. We already have some other code for downloading existing Scylla releases, written for different purposes, in `cqlpy/fetch_scylla.py`. I made zero effort to reuse that in any way. Note: hardcoding the package version might be uncool, but if we want "better" version selection (e.g. the newest patch version in the given branch), we should have a separate library (or web service) for that, and share it with CCM/SCT. If we add a separate automatic version selection mechanism here, we are going to end up with yet another half-broken Scylla version selector, with yet different syntax and semantics than the other ones. We never clear the downloaded and unpacked files. This could become a problem in the future. (At which point we can add some mechanism that deletes cached archives downloaded more than a week ago.)	2025-06-02 15:03:08 +02:00
Michał Chojnowski	cc7432888e	pylib/scylla_cluster: give a way to pass executable-specific options to nodes I'm trying to adapt pylib to multi-version tests. (Where the Scylla cluster is upgraded to a newer Scylla version during the test). Before this patch, the initial config (where "config" == yaml file + CLI args) of the nodes is hardcoded in scylla_cluster.py. The problem is that this config might not apply to past versions, so we need some way to give them a different config. (For example, with the config as it is before the patch, a Scylla 2025.1 executable would not boot up because it does not know the `group0_voter_handler` logger). In this patch, we create a way to attach version-specific config to the executable passed to ScyllaServer.	2025-06-02 15:03:08 +02:00
Michał Chojnowski	63218bb094	dbuild: mount "$XDG_CACHE_HOME/scylladb" We will use it to keep a cache of artifact downloads for upgrade tests, across dbuild invocations.	2025-06-02 15:03:08 +02:00
Andrei Chekun	738cbc07b5	test.py: enhance boost_facade missing log file handling If the log file absent or empty test fails with an error regarding a missing boost log file, however, it's not helpful since it's not a root cause of the fail. Adding logic to log this issue as a warning in a pytest's log file and continue with providing results to the pytest itself.	2025-06-02 12:17:10 +02:00
Andrei Chekun	5f6740c1fa	test.py: switch using f-string instead format in facades Switching to f-string formatting to simplify the code and to unify it with a general approach for formatting strings.	2025-06-02 12:16:47 +02:00
Pavel Emelyanov	7fef2c4f61	Merge 'test.py: fix metrics gathering' from Andrei Chekun Move of the run_process done in https://github.com/scylladb/scylladb/pull/24091 was not fully correct. The method run_process was not overridden in the class ResourceGatherOn, so no metrics are collected at all. Additionally, fix metrics DB location second time. Closes scylladb/scylladb#24306 * github.com:scylladb/scylladb: test.py: fix metrics DB location test.py: fix the possibility to gather resource metrics for test	2025-06-02 13:12:42 +03:00
Botond Dénes	e82b0dff3e	Merge 'Move mutation_fragment_v2::kind into mutation_fragment_v2::data, mutation_fragment::kind into mutation_fragment::data' from Radosław Cybulski Move mutation_fragment_v2::kind field into mutation_fragment_v2::data. Move mutation_fragment::kind field into mutation_fragment::data. In both cases the move reduces size of the object by half (to 8 bytes). On top of testsuite this patch was tested manually. First patched scylla was run. A keyspace and a table was created, with columns TEXT, INT, DOUBLE, BOOLEAN and TIMESTAMP. One row was inserted, `select ` was executed to make sure it's there. Then scylla was terminated and non-patched scylla was run, another row was inserted and `select ` was run to verify both rows exist. After this patched scylla was against started, third row was inserted and final `select ` was done to verify all three rows are there. This is partial fix to https://github.com/scylladb/scylla-enterprise/issues/5288 issue. Closes scylladb/scylladb#23452 github.com:scylladb/scylladb: Move mutation_fragment::kind into data object Make mutation_fragment::kind enum 1 byte size Move mutation_fragment_v2::kind into data object Make mutation_fragment_v2::kind enum 1 byte size	2025-06-02 10:57:17 +03:00
Evgeniy Naydanov	e780164a67	test.py: dtest: make auth_roles_test.py run using test.py As a part of the porting process, remove unused imports and markers, remove non-next_gating tests, and code for old ScyllaDB versions. Enable the test in suite.yaml (run in dev mode only)	2025-06-02 05:14:41 +00:00
Evgeniy Naydanov	145c2fed97	test.py: dtest: add wait_for_any_log() to tools/log_utils.py Copy wait_for_any_log() function from dtest tools/log_utils.py with few modifications: - Add type hints; - Change timeout for node.watch_log_for() calls from 0 to 0.1 because dtest shim's implementation uses asyncio.timeout() and 0 means not "one time" but "never run"; - Use set() instead of list() for `ret` variable; - Remove redundant `found` variable. - Remove `remaining` variable and use shallow copies to make the code more correct. As a side effect this makes the TimeoutError message more correct too; - Use f-string formatting for TimeoutError message;	2025-06-02 05:14:41 +00:00
Evgeniy Naydanov	ff2aea7e5b	test.py: dtest: add part of tools/assertions.py Copy few assertion functions from dtest tools/assertions.py: - assertion_exception() - assertion_invalid() - assertion_one() - assertion_all()	2025-06-02 05:14:41 +00:00
Evgeniy Naydanov	9d70b6307b	test.py: dtest: pickup latest code for retrying.py from dtest Sync retrying.py with dtest.	2025-06-02 05:14:41 +00:00
Evgeniy Naydanov	40464faef3	test.py: dtest: copy unmodified auth_roles_test.py The test is disabled in suite.yaml	2025-06-02 05:14:41 +00:00
Jenkins Promoter	7d562c24b1	Update pgo profiles - aarch64	2025-06-01 04:45:06 +03:00
Jenkins Promoter	75cf16afa2	Update pgo profiles - x86_64	2025-06-01 04:31:56 +03:00
Botond Dénes	c52aec3d2f	Merge 'tablets: fix missing data after tablet merge ' from Raphael Raph Carvalho Consider the following scenario: 1) let's assume tablet 0 has range [1, 5] (pre merge) 2) tablet merge happens, tablet 0 has now range [1, 10] 3) tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5] 4) during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time 5) replica service is asked to consume range [1, 10] of tablet 0 (post merge) We have two possible outcomes: With cache bypass: 1) cache reader is bypassed 2) sstable reader is created on range [1, 10] 3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10] With cache: 1) cache reader is created 2) finds partition with token 5 is cached 3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0) 4) incremental selector consumes the pre-merge sstable spanning range [1, 5] 4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached 4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed. So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read. This change fixes the bug and triggers a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets. Additionally, includes a fix for range reads that span more than one tablet, which can happen during split execution. Fixes: https://github.com/scylladb/scylladb/issues/23313 This change needs to be backported to all supported versions which implement tablet merge. Closes scylladb/scylladb#24287 * github.com:scylladb/scylladb: replica: Fix range reads spanning sibling tablets test: add reproducer and test for mutation source refresh after merge tablets: trigger mutation source refresh on tablet count change	2025-05-30 15:37:29 +03:00
Anna Stuchlik	28cb5a1e02	doc: add OS support for ScyllaDB 2025.2 This commit adds the information about support for platforms in ScyllaDB version 20252. Fixes https://github.com/scylladb/scylladb/issues/24180 Closes scylladb/scylladb#24263	2025-05-30 12:23:59 +03:00
Calle Wilund	942477ecd9	encryption/utils: Move encryption httpclient to "general" REST client Fixed #24296 While the HTTP client used for REST calls in AWS/GCP KMS integration (EAR) is not general enough to be called a HTTP client as such, it is general enough to be called a REST client (limited to stateless, single-op REST calls). Other code, like general auth integrations (hello Azure) and similar could reuse this to lessen code duplication. This patch simply moves the httpclient class from encryption to "rest" namespace, and explicitly "limits" it to such usage. Making an alias in encryption to avoid touching more files than needed. Closes scylladb/scylladb#24297	2025-05-30 12:21:51 +03:00
Pavel Emelyanov	a65ffdd0df	test/result_utils: Do not assume map_reduce reducing order When map_reduce is called on a collection, one shouldn't expect that it processes the elements of the collection in any specific order. Current test of map-reduce over boost outcome assumes that if reduce function is the string concatenation, then it would concatenate the given vector of strings in the order they are listed. That requirement should be relaxed, and the result may have reversed concatentation. Fixes scylladb/scylladb#24321 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24325	2025-05-30 09:38:59 +02:00
Michael Litvak	3a1be33143	test_cdc_generation_publishing: fix to read monotonically The test test_multiple_unpublished_cdc_generations reads the CDC generation timestamps to verify they are published in the correct order. To do so it issues reads in a loop with a short sleep period and checks the differences between consecutive reads, assuming they are monotonic. However the assumption that the reads are monotonic is not valid, because the reads are issued with consistency_level=ONE, thus we may read timestamps {A,B} from some node, then read timestamps {A} from another node that didn't apply the write of the new timestamp B yet. This will trigger the assert in the test and fail. To ensure the reads are monotonic we change the test to use consistency level ALL for the reads. Fixes scylladb/scylladb#24262 Closes scylladb/scylladb#24272	2025-05-30 08:35:56 +02:00
Pavel Emelyanov	086777e5de	Merge 'test.py: python: run tests using bare pytest command' from Evgeniy Naydanov Main change is splitting logic of `PythonTest.run()` method into `PythonTest.run_ctx()` context manager and `PythonTest.run()` method itself and add the `host` fixture which uses `PythonTest.run_ctx()` context manager to setup and teardown ScyllaDB node if `--test-py-init` argument is used. Otherwise, this fixture returns a value of `--host` CLI argument. Use dynamic scope provided by `testpy_test_fixture_scope()` function instead of `session` to maintain compatibility with `test.py` and `./run` scripts. Other related changes: * Add utility `get_testpy_test()` function to `pylib.suite.base` which combines all required steps to create an instance of `Test` class and rework `testpy_test` fixture to use it. * Switch to use dynamic fixture scope controlled by `--test-py-init` CLI argument to improve compatibility with test.py. And because in test.py mode the scope is `session`, also change default event loop scope to `session`. * Convert `get_valid_alternator_role()` to fixture to have more control on the scope of the cache used. Additionally, function `new_dynamodb_session()` was also converted to a fixture, because it uses `get_valid_alternator_role()`. * Replace dups of `cql` and `this_dc` fixtures in `rest_api` and `pylib/cql_repl` with imports from `cqlpy`. * Change `build_mode` fixture to return "unknown" if no --mode arguments provided (this is mainly for alternator and cqlpy tests) * Create a parent directory for a test log file just before opening this file in `run_test()` function instead of having this as a side effect in `Test.__init__()`. And changes that remove pytest CLI argument duplicates to be able to run tests from different test suites in one pytest session: * Add 3 supplementary functions to `test.pylib.suite.python`: `add_host_option()` (which adds `--host` options to pytest session), `add_cql_connection_options()` (which adds `--port`, and `--ssl`), and `--add-s3-options` (which adds options related to S3 connection.) Each function decorated with `@cache` decorator to be executed once per pytest session and avoid CLI options duplication for runs which executes `alternator`, `cqlpy`, `rest_api`, or `broadcast_tables` in one pytest session. * Move `--auth_username` and `--auth_password` options from `cluster/conftest.py` to add_scylla_cql_connection_options() and slightly rework `cql` fixture to support these options. * Remove `--input`, `--output`, and `--keep-tmp` pytest CLI opionts from `cluster/object_store/conftest.py` because they are not used in these suite. * Remove `--omit-scylla-output` CLI option from pytest argparser. Instead, remove it from `sys.argv` in `cqlpy/run.py`. Also, no need to check this option in `alternator/run`. Closes scylladb/scylladb#23849 * github.com:scylladb/scylladb: test.py: python: run tests using bare pytest command test.py: rework testpy_test fixture test.py: alternator: convert get_valid_alternator_role() to fixture test.py: python: split logic of PythonTest.run() test.py: add credentials options to add_cql_connection_options() test.py: python: remove dups of cql and this_dc fixtures test.py: remove duplication of pytest CLI options test.py: remove unused CLI options test.py: remove `--omit-scylla-output` from pytest argparser test.py: set build_mode to "unknown" if no --mode argument test.py: create directory for test log in run_test()	2025-05-30 08:48:43 +03:00
Botond Dénes	7db956965e	mutation/mutation_compactor: cache regular/shadowable max-purgable in separate members Max purgeable has two possible values for each partition: one for regular tombstones and one for shadowable ones. Yet currently a single member is used to cache the max-purgeable value for the partition, so whichever kind of tombstone is checked first, its max-purgeable will become sticky and apply to the other kind of tombstones too. E.g. if the first can_gc() check is for a regular tombstone, its max-purgeable will apply to shadowable tombstones in the partition too, meaning they might not be purged, even though they are purgeable, as the shadowable max-purgeable is expected to be more lenient. The other way around is worse, as it will result in regular tombstone being incorrectly purged, permitted by the more lenient shadowable tombstone max-purgeable. Fix this by caching the two possible values in two separate members. A reproducer unit test is also added. Fixes: scylladb/scylladb#23272 Closes scylladb/scylladb#24171	2025-05-29 22:52:08 +03:00
Avi Kivity	f0ec9dd8f2	Merge 'utils/logalloc: enforce the max contiguous allocation size limit' from Michał Chojnowski This series fixes the only known violation of logalloc's allocation size limits (in `chunked_managed_vector`), and then it make those limits hard. Before the series, LSA handles overly-large allocations by forwarding them to the standard allocator. After the series, an attempt to do an overly large allocations via LSA will trigger an `on_internal_error` instead. We do this because the allocator fallback logic turned out to have subtle and problematic accounting bugs. We could fix them, or we can remove the mechanism altogether. It's hard to say which choice is better. This PR arbitrarily makes the choice to remove the mechanism. This makes the logic simpler, at the risk of escalating some allocation size bugs to crashes. See the descriptions of individual commits for more details. Fixes scylladb/scylladb#23850 Fixes scylladb/scylladb#23851 Fixes scylladb/scylladb#23854 I'm not sure if any of this should be backported or not. The `chunked_managed_vector` fix could be backported, because it's a bugfix. It's an old bug, though, and we have never observed problems related to it. The changes to `logalloc` aren't supposed to be fixing any observable problem, so a backport probably has more risk than benefit in this case. Closes scylladb/scylladb#23944 * github.com:scylladb/scylladb: utils/logalloc: enforce LSA allocation size limits utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity()	2025-05-29 22:11:41 +03:00
Szymon Malewski	18d237a393	alternator/executor: Added checks in `batch_write_item` This patch adds checks validating 'BatchWriteItem' requests mostly to avoid ugly fallback message. It changes request's behaviour in case of an empty array of WriteRequests - previously such an array was ignored and whole request might succeed, now it raises ValidationException, following the documentation and behaviour of DynamoDB. Patch includes tests in test_manual_requests (`test_batch_write_item_invalid_payload`, `test_batch_write_item_empty_request_list`) testing with several offending cases. Fixes #23233 Closes scylladb/scylladb#23878	2025-05-29 20:33:57 +03:00
Patryk Jędrzejczak	c21692f3a6	Merge 'token_range_vector: fragment' from Avi Kivity token_range_vector is a sequence of intervals of tokens. It is used to describe vnodes or token ranges owned by shards. Since tokens are bloated (16 bytes instead of 8), and intervals are bloated (40 byte of overhead instead of 8), and since we have plenty of token ranges, such vectors can exceed our allocation unit of 128 kB and cause allocation stalls. This series fixes that by first generalizing some helpers and then changing token_range_vector to use chunked_vector. Although this touches IDL, there is no compatibility problem since the encoding for vector and chunked_vector are identical. There is no performance concern since token_range_vector is never used on any hot path (hot paths always contain a partition key). Fixes #3335. Fixes #24115. No backport: minor performance fix that isn't a regression. Closes scylladb/scylladb#24205 * https://github.com/scylladb/scylladb: dht: fragment token_range_vector partition_range_compat: generalize wrap/unwrap helpers	2025-05-29 18:45:13 +02:00
Robert Bindar	c570941692	Add nodetool refresh --scope option This change adds the --scope option to nodetool refresh. Like in the case of nodetool restore, you can pass either of: * node - On the local node. * rack - On the local rack. * dc - In the datacenter (DC) where the local node lives. * all (default) - Everywhere across the cluster. as scope. The feature is based on the existing load_and_stream paths, so it requires passing --load-and-stream to the refresh command. Also, it is not compatible with the --primary-replica-only option. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#23861	2025-05-29 16:12:09 +03:00
Evgeniy Naydanov	0ee0e3f14d	test.py: python: run tests using bare pytest command Add the `host` fixture which uses `PythonTest.run_ctx()` context manager to setup and teardown ScyllaDB node if `--test-py-init` argument is used. Otherwise, this fixture returns a value of `--host` CLI argument. Use dynamic scope provided by `testpy_test_fixture_scope()` function instead of `session` to maintain compatibility with test.py and ./run scripts.	2025-05-29 12:33:41 +00:00
Evgeniy Naydanov	b67048f3ee	test.py: rework testpy_test fixture Add utility `get_testpy_test()` function to `pylib.suite.base` which combines all required steps to create an instance of `Test` class. Remove redundant `testpy_testsuite` fixture. Switch to use dynamic fixture scope controlled by `--test-py-init` CLI argument to improve compatibility with test.py. And because in test.py mode the scope is `session`, also change default event loop scope to `session`. The fixture is None for test.py mode. test.py runs tests file-by-file as separate pytest sessions, so, `session` scope is effectively close to be the same as `module` (can be a difference in the order.) In case of running tests with bare pytest command, we need to use `module` scope to maintain same behavior as test.py, since we run all tests in one pytest session.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	b65cb517b8	test.py: alternator: convert get_valid_alternator_role() to fixture Convert `get_valid_alternator_role()` to fixture to have more control on the scope of the cache used. Additionally, function `new_dynamodb_session()` was also converted to a fixture, because it uses `get_valid_alternator_role()`.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	1f94a9c052	test.py: python: split logic of PythonTest.run() Split logic of `PythonTest.run()` method into `PythonTest.run_ctx()` context manager and `PythonTest.run()` method itself. Done this to reuse setup/teardown code with bare pytest command runs.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	27cbfc77fb	test.py: add credentials options to add_cql_connection_options() Move `--auth_username` and `--auth_password` options from `cluster/conftest.py` to add_cql_connection_options() and slightly rework `cql` fixture to support these options.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	2bba4acdea	test.py: python: remove dups of cql and this_dc fixtures Replace dups of `cql` and `this_dc` fixtures in `rest_api` and `pylib/cql_repl` with imports from `cqlpy`.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	6780461df8	test.py: remove duplication of pytest CLI options Add 3 supplementary functions to `test.pylib.suite.python`: `add_host_option()` (which adds `--host` options to pytest session), `add_cql_connection_options()` (which adds `--port`, and `--ssl`), and `--add-s3-options` (which adds options related to S3 connection.) Each function decorated with `@cache` decorator to be executed once per pytest session and avoid CLI options duplication for runs which executes `alternator`, `cqlpy`, `rest_api`, or `broadcast_tables` in one pytest session.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	056c5db829	test.py: remove unused CLI options Remove `--input`, `--output`, and `--keep-tmp` pytest CLI opionts from `cluster/object_store/conftest.py` because they are not used in these suite.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	b7b68355ef	test.py: remove `--omit-scylla-output` from pytest argparser Remove `--omit-scylla-output` CLI option from pytest argparser. Instead, remove it from `sys.argv` in `cqlpy/run.py`. Also, no need to check this option in `alternator/run`.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	f262d4c323	test.py: set build_mode to "unknown" if no --mode argument Change `build_mode` fixture to return "unknown" if no --mode arguments provided (this is mainly for alternator and cqlpy tests)	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	30d542b8f1	test.py: create directory for test log in run_test() Create a parent directory for a test log file just before opening this file in `run_test()` function instead of having this as a side effect in `Test.__init__()`.	2025-05-29 12:15:28 +00:00
Piotr Dulikowski	c8d52a4318	Merge 'test.py: dtest: port bypass_cache_test.py' from Evgeniy Naydanov Copy bypass_cache_test.py from scylla-dtest test suite and make it works with test.py As a part of the porting process, copy missed utility functions from scylla-dtest, remove unused imports and markers, and add missed `single_node` marker description to pytest.ini Enable the test in suite.yaml (run in dev mode only.) Also add missed `ScyllaCluster.nodetool()` method in dtest shim code. Closes scylladb/scylladb#24230 * github.com:scylladb/scylladb: test.py: dtest: make bypass_cache_test.py run using test.py test.py: dtest: add missed ScyllaCluster.nodetool() test.py: dtest: copy unmodified bypass_cache_test.py	2025-05-29 13:48:10 +02:00
Michał Chojnowski	cb02d47b10	utils/logalloc: enforce LSA allocation size limits In order to guarantee a decent upper limit on fragmentation, LSA only handles allocations smaller than 0.1 of a segment. Allocations larger than this limit are permitted, but they are not placed in LSA segments. Instead, they are forwarded to the standard allocator. We don't really have any use case for this "fallback". As far as I can tell, it only exists for "historical" reasons, from times where there were some data structures which weren't fully adapted to LSA yet. We don't the fallback to be used. Long-lived standard allocations are undesirable. They have higher internal fragmentation than LSA allocations, and they can cause external fragmentation in the standard allocator. So we want to eliminate them all. The only reason to keep the fallback is to soften the impact if some bug results in limit-exceeding LSA allocations happening in production. In principle, the fallback turns a crash (or something similarly drastic) into just a performance problem. However, it turns out that the fallback is buggy. Recently we had a bug which caused limit-exceeding LSA allocations to happen. And then it turned out that LSA reclaim doesn't deal fully correctly with evictable non-LSA allocations, and the dirty_memory_manager accounting for non-LSA allocations is completely wrong. This resulted in subtle, serious, and hard to understand stability problems in production. Arguably the biggest problem is that the "fallback" allocations weren't reported in any way. They were happening in some tests, but they were silently permitted, so nobody noticed that they should be eliminated. If we just had a rate-limited error log that reports fallback allocations, they would have never got into a release. So maybe we could fix the fallback, add more tests for it, add a warning for when it's used, and keep it. But this PR instead opts for removing the fallback mechanism altogether and failing fast. After the patch, if a non-conforming allocation happens, it will trigger an `on_internal_error`. With this, we risk a greater impact if some non-conforming allocations happen in production, but we make the system simpler. It's hard to say if it's a good tradeoff.	2025-05-29 13:05:08 +02:00
Piotr Dulikowski	555925c66b	Merge 'generic_server: transport: improve stats counting and shedding' from Marcin Maliszkiewicz The patch removes connection advertising functions and moves the logic to constructors and destructors, providing a more robust way of counting connections. This change was also necessary to allow skipping the connection process function during shedding, as the active connections counter needs to be decremented. The patch doesn't fix any active bug, just improves the flow. Backport: none, it's a cosmetic change Closes scylladb/scylladb#23890 * github.com:scylladb/scylladb: generic_server: make shutdown() return void generic_server: skip connection processing logic after shedding the connection transport: generic_server: remove no longer used connection advertising code transport: move new connection trace logs into connection class ctor/dtor transport: move cql connections counting into connection class ctor/dtor	2025-05-29 12:49:58 +02:00
Avi Kivity	c00824c7df	Merge 'transport: Implement SCYLLA_USE_METADATA_ID support' from Andrzej Jackowski Metadata id was introduced in CQLv5 to make metadata of prepared statement metadata consistent between driver and database. This commit introduces a protocol extension that allows to use the same mechanism in CQLv4. As CQLv5 is currently unsupported in ScyllaDb (as well as in some of the drivers), the motivation is to allow fixing https://github.com/scylladb/scylladb/issues/20860. This change: - Implement metadata::calculate_metadata_id() - Implement SCYLLA_USE_METADATA_ID protocol extension for CQLv4 - Added description of SCYLLA_USE_METADATA_ID in documentation - Add boost tests to confirm correctness of the function - Add python tests for table metadata change corner-cases Fixes scylladb/scylladb#20860 Also see related https://scylladb.atlassian.net/wiki/spaces/RND/pages/42238631/MetadataId+extension+in+CQLv4+Requirement+Document No backport needed (unless specifically requested by a customer), because there are existing workarounds for the issue Closes scylladb/scylladb#23292 * github.com:scylladb/scylladb: test: add tests for prepared statement metadata consistency corner cases transport: implement SCYLLA_USE_METADATA_ID support cql3: implement metadata::calculate_metadata_id()	2025-05-29 12:27:31 +03:00
Andrei Chekun	0c5676ffb4	test.py: fix metrics DB location This was already fixed, but unintentionally during rebases it was reverted and merged to master in the same PR.	2025-05-28 20:13:38 +02:00
Andrei Chekun	6e92791538	test.py: fix the possibility to gather resource metrics for test Move of the run_process done in #24091 was not fully correct. The method run_process was not overridden in the class ResourceGatherOn, so no metrics are collected at all.	2025-05-28 20:13:31 +02:00
Ran Regev	37854acc92	changed the string literals into the correct ones Fixes: #23970 use correct string literals: KMIP_TAG_CRYPTOGRAPHIC_LENGTH_STR --> KMIP_TAGSTR_CRYPTOGRAPHIC_LENGTH KMIP_TAG_CRYPTOGRAPHIC_USAGE_MASK_STR --> KMIP_TAGSTR_CRYPTOGRAPHIC_USAGE_MASK From https://github.com/scylladb/scylladb/issues/23970 description of the problem (emphasizes are mine): When transparent data encryption at rest is enabled with KMIP as a key provider, the observation is that before creating a new key, Scylla tries to locate an existing key with provided specifications (key algorithm & length), with the intention to re-use existing key, but the attributes sent in the request have minor spelling mistakes which are rejected by the KMIP server key provider, and hence scylla assumes that a key with these specifications doesn't exist, and creates a new key in the KMIP server. The issue here is that for every new table, ScyllaDB will create a key in the KMIP server, which could clutter the KMS, and make key lifecycle management difficult for DBAs. Closes scylladb/scylladb#24057	2025-05-28 13:52:30 +03:00
Pavel Emelyanov	2eed2e94ea	sstables_loader: Extend logging with recently added skip-cleanup When starting, the loader prints all its arguments into logs. Recently added skip-cleanup one is not included, but it's good to have one too. refs: #24139 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24206	2025-05-28 11:20:27 +03:00
David Garcia	9542bfd2b1	docs: enable ai chatbot docs: enable ai chatbot Closes scylladb/scylladb#24286	2025-05-28 11:04:25 +03:00
Yaron Kaikov	0831931fec	.github/workflows/conflict_reminder: reduce the amount of conflict reminder for every push event In order to avoid spamming PR author about conflicts, added a logic to verify during push events, that in case PR is already in draft mode, we will check when was the last notification, if it's less then 3 days, we will skip it Closes scylladb/scylladb#24289	2025-05-28 11:01:44 +03:00
Nadav Har'El	61581d458e	Merge 'vector_index: add custom index class from Michał Hudobski This PR adds a class that allows for validation (and in the future creating and querying) of custom indexes and implements it for vector indexes. Currently custom vector_index creation runs a usual index creation process. This PR does not change that, however it adds validation of the parameters that need to have certain values for the actual creation of the vector index in the future. The only thing left for the vector_index feature to work as intended should be the integration with the Vector Store service. This is a continuation of https://github.com/scylladb/scylladb/pull/23720 Refs: [VS-55 ](https://scylladb.atlassian.net/browse/VS-55) (Support setting index parametrs and similarity function in CREATE INDEX) Fixes: [VS-13](https://scylladb.atlassian.net/browse/VS-13) (Validate that the base type is numeric when creating the vector index) [VS-13]: https://scylladb.atlassian.net/browse/VS-13?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#24212 * github.com:scylladb/scylladb: test/cqlpy: remove xfail and add more vector tests vector_index: allow options when custom class is provided vector_index: add custom index and vector index classes	2025-05-28 10:42:29 +03:00
Raphael S. Carvalho	53df911145	replica: Fix range reads spanning sibling tablets We don't guarantee that coordinators will only emit range reads that span only one tablet. Consider this scenario: 1) split is about to be finalized, barrier is executed, completes. 2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet) 3) split is committed to group0, all replicas switch storage. 4) replica-side read is executed, uses a range which spans tablets. We could fix it with two-phase split execution. Rather than pushing the complexity to higher levels, let's fix incremental selector which should be able to serve all the tokens owned by a given shard. During split execution, either of sibling tablets aren't going anywhere since it runs with state machine locked, so a single read spanning both sibling tablets works as long as the selector works across tablet boundaries. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-05-27 22:39:40 -03:00
Michał Hudobski	195e6a82de	test/cqlpy: remove xfail and add more vector tests We have added validation for options and type of column for vector indexes. This commit adds tests for that validation.	2025-05-27 21:04:50 +02:00
Michał Hudobski	7a2b0179e8	vector_index: allow options when custom class is provided We have changed the validation for the custom index to not require the CUSTOM keyword when creating the index, only the custom class now we change the validation for options so that they match.	2025-05-27 21:04:50 +02:00
Michał Hudobski	3ab643a5de	vector_index: add custom index and vector index classes In this patch we add an abstract class, "custom_index", with a validate() method. Each CUSTOM INDEX class needs to implement a concrete subclass of custom_index which is used to validate if this type of custom index class may be used, and whether the optional parameters passed to it are valid. We change the existing CUSTOM INDEX validation code to use this new mechanism. Finally this patch implements one concrete subclass for vector index. Before this patch, the custom index type "vector_index" was allowed, but after this patch it gains more validation of its optional parameters (we support 4 specific parameters, with some rules on their values). Of course, the vector index isn't actually implemented in this patch, we are just improving the validation of the index creation statement.	2025-05-27 21:04:50 +02:00
Marcin Maliszkiewicz	7f057af1f2	replica: make non-preemptive keyspace create/update/delete functions public As those operations will be managed by schema_applier class. This will be implemented in following commit.	2025-05-27 20:01:35 +02:00
Marcin Maliszkiewicz	2daa630938	replica: split update keyspace into two phases - first phase is preemptive (prepare_update_keyspace) - second phase is non-preemptive (update_keyspace) This is done so that schema change can be applied atomically. Aditionally create keyspace code was changed to share common part with update keyspace flow. This commit doesn't yet change the behaviour of the code, as it doesn't guarantee atomicity, it will be done in following commits.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	fe0f4033ca	replica: split creating keyspace into two functions This is done so that in following commits insert_keyspace can be used to atomically change schema (as it doesn't yield).	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	aceb1f9659	db: rename create_keyspace_from_schema_partition It only creates keyspace metadata.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	f8fe51640a	db: decouple functions and aggregates schema change notification from merging code	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	52069d954f	db: store functions and aggregates change batch in schema_applier To be used in following commit.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	5fff3097a5	db: decouple tables and views schema change notifications from merging code As post_commit() can't be fully implemented at this stage, it was moved to interim place to keep things working. It will be moved back later.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	6f8579e242	db: store tables and views schema diff in schema_applier It will be used in subsequent commit for moving notifications code.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	b74c1e9ae4	db: decouple user type schema change notifications from types merging code Merging types code now returns generic affected_types structure which is used both for notifications and dropping types. New static function drop_types() replaces dropping lambda used before. While I think it's not necessary for dropping nor notifications to use per shard copies (like it's using before and after this patch) it could just use string parameters or something similar but this requires too many changes in other classes so it's out of scope here.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	3a95edd0d7	service: unify keyspace notification functions arguments Keyspace metadata is not used, only name is needed so we can remove those extra find_keyspace() calls. Moreover there is no need to copy the name.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	d7202586ca	db: replica: decouple keyspace schema change notifications to a separate function In following commits we want to separate updating code from committing shema change (making it visible). Since notifications should be issued after change is visible we need to separate them and call after committing. In subsequent commits other notification types will be moved too. We change here order of notification calls with regards to rest of schema updating code. I.e. before keyspace notifications triggered before tables were updated, after the change they will trigger once everything is updated. There is no indication that notification listeners depend on this behaviour.	2025-05-27 19:59:47 +02:00
Marcin Maliszkiewicz	ddf9f7ae05	db: add class encapsulating schema merging This commit doesn't yet change how schema merging works but it prepares the ground for it. We split merging code into several functions. Main reasons for it are that: - We want to generalize and create some interface which each subsystem would use. - We need to pull mutation's apply() out of the code because raft will call it directly, and it will contain a mix of mutations from more than one subsystem. This is needed because we have the need to update multiple subsystems atomically (e.g. auth and schema during auto-grant when creating a table). In this commit do_merge_schema() code is split between prepare(), update(), commit(), post_commit(). The idea behind each of these phases is described in the comments. The last 2 phases are not yet implemented as it requires more code changes but adding schema_applier enclosing class will help to create some copied state in the future and implement commit() and post_commit() phases.	2025-05-27 19:33:02 +02:00
Marcin Maliszkiewicz	1eb580973c	generic_server: make shutdown() return void It's always immediately ready so no need to return future<>.	2025-05-27 19:31:09 +02:00
Marcin Maliszkiewicz	d76d1766ad	generic_server: skip connection processing logic after shedding the connection Since input and output descriptors are already closed at this point there is no need to call connection::process. This should make shedding use slightly less resources.	2025-05-27 19:31:09 +02:00
Marcin Maliszkiewicz	f7e5adaca3	transport: generic_server: remove no longer used connection advertising code	2025-05-27 19:31:09 +02:00
Marcin Maliszkiewicz	81f0e79dc0	transport: move new connection trace logs into connection class ctor/dtor This is a step towards replacing advertise_new_connection/unadvertise_connection by RAII which is less error prone. Advertising will be removed in subsequent commit.	2025-05-27 19:30:56 +02:00
Marcin Maliszkiewicz	371b959539	transport: move cql connections counting into connection class ctor/dtor This is a step towards replacing advertise_new_connection/unadvertise_connection by RAII which is less error prone. Advertising will be removed in subsequent commit.	2025-05-27 19:30:39 +02:00
Dawid Mędrek	c60035cbf6	test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default We've adjusted all of the Boost tests so they respect the invariant enforced by the `rf_rack_valid_keyspaces` configuration option, or explicitly disabled the option in those that turned out to be more problematic and will require more attention. Thanks to that, we can now enable it by default in the test suite.	2025-05-27 18:53:39 +02:00
Dawid Mędrek	237638f4d3	test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests Some of the tests in the file verify more subtle parts of the behavior of tablets and rely on topology layouts or using keyspaces that violate the invariant the `rf_rack_valid_keyspaces` configuration option is trying to enforce. Because of that, we explicitly disable the option to be able to enable it by default in the rest of the test suite in the following commit.	2025-05-27 18:53:36 +02:00
Anna Stuchlik	efce03ef43	doc: clarify RF increase issues for tablets vs. vnodes This commit updates the guidelines for increasing the Replication Factor depending on whether tablets are enabled or disabled. To present it in a clear way, I've reorganized the page. Fixes https://github.com/scylladb/scylladb/issues/23667 Closes scylladb/scylladb#24221	2025-05-27 17:47:50 +02:00
Dawid Mędrek	22d6c7e702	test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load	2025-05-27 16:01:14 +02:00
Dawid Mędrek	fa62f68a57	test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity We make sure that the keyspaces created in the test are always RF-rack-valid. To achieve that, we change how the test is performed. Before this commit, we first created a cluster and then ran the actual test logic multiple times. Each of those test cases created a keyspace with a random replication factor. That cannot work with `rf_rack_valid_keyspaces` set to true. We cannot modify the property file of a node (see commit: `eb5b52f598`), so once we set up the cluster, we cannot adjust its layout to work with another replication factor. To solve that issue, we also recreate the cluster in each test case. Now we choose the replication factor at random, create a cluster distributing nodes across as many racks as RF, and perform the rest of the logic. We perform it multiple times in a loop so that the test behaves as before these changes.	2025-05-27 15:52:38 +02:00
Dawid Mędrek	cd615c3ef7	test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity We distribute the nodes used in the test across two racks so we can run the test with `rf_rack_valid_keyspaces` set to true. We want to avoid cross-rack migrations and keep the test as realistic as possible. Since host3 is supposed to function as a new node in the cluster, we change the layout of it: now, host1 has 2 shards and resides in a separate rack. Most of the remaining test logic is preserved and behaves as before this commit. There is a slight difference in the tablet migrations. Before the commit, we were migrating a tablet between nodes of different shard counts. Now it's impossible because it would force us to migrate tablets between racks. However, since the test wants to simply verify that an ongoing migration doesn't interfere with load balancing and still leads to a perfect balance, that still happens: we explicitly migrate ONLY 1 tablet from host2 to host3, so to achieve the goal, one more tablet needs to be migrated, and we test that.	2025-05-27 15:41:27 +02:00
Ferenc Szili	1f9f724441	test: add reproducer and test for mutation source refresh after merge This change adds a reproducer and test for the fix where the local mutation source is not always refreshed after a tablet merge.	2025-05-27 15:18:36 +02:00
Ferenc Szili	d0329ca370	tablets: trigger mutation source refresh on tablet count change Consider the following scenario: - let's assume tablet 0 has range [1, 5] (pre merge) - tablet merge happens, tablet 0 has now range [1, 10] - tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5] - during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time - replica service is asked to consume range [1, 10] of tablet 0 (post merge) We have two possible outcomes: With cache bypass: 1) cache reader is bypassed 2) sstable reader is created on range [1, 10] 3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10] With cache: 1) cache reader is created 2) finds partition with token 5 is cached 3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0) 4) incremental selector consumes the pre-merge sstable spanning range [1, 5] 4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached 4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed. So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read. This change fixes the bug and triggeres a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets. Fixes: #23313	2025-05-27 15:15:43 +02:00
Wojciech Mitros	5074daf1b7	test: actually wait for tablets to distribute across nodes In test_tablet_mv_replica_pairing_during_replace, after we create the tables, we want to wait for their tablets to distribute evenly across nodes and we have a wait_for for that. But we don't await this wait_for, so it's a no-op. This patch fixes it by adding the missing await. Refs scylladb/scylladb#23982 Refs scylladb/scylladb#23997 Closes scylladb/scylladb#24250	2025-05-27 15:12:25 +02:00
Avi Kivity	844a49ed6e	dht: fragment token_range_vector token_range_vector is a linear vector containing intervals of tokens. It can grow quite large in certain places and so cause stalls. Convert it to utils::chunked_vector, which prevents allocation stalls. It is not used in any hot path, as it usually describes vnodes or similar things. Fixes #3335.	2025-05-27 14:47:24 +03:00
Avi Kivity	83c2a2e169	partition_range_compat: generalize wrap/unwrap helpers These helpers convert vectors of wrapped intervals to vectors of unwrapped intervals and vice versa. Generalize them to work on any sequence type. This is in preparation of moving from vectors to chunked_vectors.	2025-05-27 14:47:21 +03:00
Botond Dénes	542b2ed0de	Merge 'Remove req_params facility from API' from Pavel Emelyanov The class was introduced to facilitate path and query parameters parsing from requests, but in fact it's mostly dead code. First, the class introduces the concept of "mandatory" parameters which are seastar path params. If missing, the parameter validation throws, but in all cases where this option is used in scylla it's impossible to get empty path param -- if the parameter is missing seastar returns 404 (not found) before calling handler. Second, the req_params::get<T>() doesn't work for anything but string argument (or types such that optional<T> can be implicitly casted to optional<sstring>). And it's in fact only used to get sstrings, so it compiles and works so far. The remaining ability to parse bool from string is partially duplicated by the validate_bool() method. Using plain method to parse string to bool is less code than req_params introduce. One (arguably) useful thing req_params do it validate the incoming request _not_ to contain unknown query parameters. However, quite a few endpoints use this, most of them just cherry-pick parameters they want and ignore the others. There's already a comprehensive description of accepted parameters for each endpoint in api-doc/ and req_params duplicate it. Good validation code should rely on api-doc/, not on its partial copy. Having said that, this PR introduces validate_bool_x() helper to do req_params-like parsing of strings to bools, patches existing handlers to use existing parameters parsing facilities (such as validate_keyspace() and parse_table_infos()) and drops the req_params. Closes scylladb/scylladb#24159 * github.com:scylladb/scylladb: api: Drop class req_params api: Stop using req_params in parse_scrub_options api: Stop using req_params in tasks::force_keyspace_compaction_async api: Stop using req_params in ss::force_keyspace_compaction api: Stop using req_params in ss::force_compaction api: Stop using req_params in cf::force_major_compaction api: Add validate_bool_x() helper	2025-05-27 14:29:05 +03:00
Ernest Zaslavsky	7d0d3ec1c8	load_and_stream: Add abortion flow to mutation streaming * The new abort command explicitly represents the abortion flow in mutation streaming, clearly identifying operations that are intentionally aborted. This reduces ambiguity around failures in streaming operations. * In the error-handling section, aborted operations are now explicitly marked as the cause of the streaming failure. This allows us to differentiate them from genuine errors and appropriately adjust log severity to reduce unnecessary alarm caused by aborted streaming failures. * To avoid alarming users with excessive error logs, log severity for streaming failures caused by aborted operations has been downgraded. This helps keep logs cleaner and prevents unnecessary concerns. * A new feature has been added to ensure mixed clusters during updates do not receive unsupported RPC messages, improving compatibility and stability. fixes: https://github.com/scylladb/scylladb/issues/23076 Closes scylladb/scylladb#23214	2025-05-27 14:21:58 +03:00
Dawid Mędrek	1199c68bac	test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity We assign the nodes created by the test to separate racks. It has no impact on the test since the keyspace used in the test uses RF=2, so the tablet replicas will still be the same.	2025-05-27 13:18:11 +02:00
Dawid Mędrek	e4e3b9c3a1	test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity We distribute the nodes used in the test between two racks. Although that may affect how tablets behave in general, this change will not have any real impact on the test. The test verifies that load balancing eventually balances tablets in the cluster, which will still happen. Because of that, the changes in this commit are safe to apply.	2025-05-27 13:18:09 +02:00
Dawid Mędrek	6e2fb79152	test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity We distribute the nodes used in the test between two racks. Although that may have an impact on how tablets behave, it's orthogonal to what the test verifies -- whether the topology coordinator is continuously in the tablet migration track. Because of that, it's safe to make this change without influencing the test.	2025-05-27 13:18:07 +02:00
Botond Dénes	485df63fd5	Merge 'Extend compaction_history table with additional compaction statistics' from Łukasz Paszkowski Currently, the `system.compaction_history` table miss information like the type of compaction (cleanup, major, resharding, etc), the sstable generations involved (in and out), shard's id the compaction was triggered on and statistics on purged tombstones to be collected during compaction. The series extends the table with the following columns: - "compaction_type" (text) - "shard_id" (int) - "sstables_in" (list<sstableinfo_type>) - "sstables_out" (list<sstableinfo_type>) - "total_tombstone_purge_attempt" (long) - "total_tombstone_purge_failure_due_to_overlapping_with_memtable" (long) - "total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable" (long) with a user defined type `sstableinfo_type` that holds the information about sstable file - generation (uuid) - origin (text) - size (long) Additional statistics stored in the compaction_history have been incorporated in the API `/compaction_manager/compaction_history` and the `nodetool compactionhistory` command. No backport is required. It extends the existing compaction history output. Fixes https://github.com/scylladb/scylladb/issues/3791 Closes scylladb/scylladb#21288 * github.com:scylladb/scylladb: nodetool: Refactor of compactionhistory_operation nodetool: Add more stats into compactionhistory output api/compaction_manager: Extend compaction_history api compaction: Collect tombstone purge stats during compaction compacting_reader: Extend to accept tombstone purge statistics mutation_compactor: Collect tombstone purge attempts compaction_garbage_collector: Extend return type of max_purgeable_fn compaction: Extend compaction_result to collect more information system_keyspace: Upgrade compaction_history table system_keyspace: Create UDT: sstableinfo_type system_keyspace: Extract compaction_history struct system_keyspace: Squeeze update_compaction_history parameters compaction/compaction_manager: update_history accepts compaction_result as rvalue	2025-05-27 14:12:13 +03:00
Anna Stuchlik	b197d1a617	doc: update migration tools overview This commit updates the migration overview page: - It removes the info about migration from SSTable to CQL. - It updates the link to the migrator docs. Fixes https://github.com/scylladb/scylladb/issues/24247 Refs https://github.com/scylladb/scylladb/pull/21775 Closes scylladb/scylladb#24258	2025-05-27 14:07:35 +03:00
Michał Chojnowski	185a032044	utils/stream_compressor: allocate memory for zstd compressors externally The default and recommended way to use zstd compressors is to let zstd allocate and free memory for compressors on its own. That's what we did for zstd compressors used in RPC compression. But it turns out that it generates allocation patterns we dislike. We expected zstd not to generate allocations after the context object is initialized, but it turns out that it tries to downsize the context sometimes (by reallocation). We don't want that because the allocations generated by zstd are large (1 MiB with the parameters we use), so repeating them periodically stresses the reclaimer. We can avoid this by using the "static context" API of zstd, in which the memory for context is allocated manually by the user of the library. In this mode, zstd doesn't allocate anything on its own. The implementation details of this patch adds a consideration for forward compatibility: later versions of Scylla can't use a window size greater than the one we hardcoded in this patch when talking to the old version of the decompressor. (This is not a problem, since those compressors are only used for RPC compression at the moment, where cross-version communication can be prevented by bumping COMPRESSOR_NAME. But it's something that the developer who changes the window size must _remember_ to do). Fixes #24160 Fixes #24183 Closes scylladb/scylladb#24161	2025-05-27 12:43:11 +03:00
Jenkins Promoter	76dddb758e	Update pgo profiles - x86_64	2025-05-27 12:02:49 +03:00
Pavel Emelyanov	bd3bd089e1	sstables_loader: Fix load-and-stream vs skip-cleanup check The intention was to fail the REST API call in case --skip-cleanup is requested for --load-and-stream loading. The corresponding if expression is checking something else :( despite log message is correct. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24208	2025-05-27 12:01:01 +03:00
Jenkins Promoter	de9d9c9ece	Update pgo profiles - aarch64	2025-05-27 11:59:56 +03:00
Andrzej Jackowski	555d897a15	test: wait for normal state propagation in test_auth_v2_migration By default, cluster tests have skip_wait_for_gossip_to_settle=0 and ring_delay_ms=0. In tests with gossip topology, it may lead to a race, where nodes see different state of each other. In case of test_auth_v2_migration, there are three nodes. If the first node already knows that the third node is NORMAL, and the second node does not, the system_auth tables can return incomplete results. To avoid such a race, this commit adds a check that all nodes see other nodes as NORMAL before any writes are done. Refs: #24163 Closes scylladb/scylladb#24185	2025-05-27 11:41:09 +03:00
Nikos Dragazis	eaa2ce1bb5	sstables: Fix race when loading checksum component `read_checksum()` loads the checksum component from disk and stores a non-owning reference in the shareable components. To avoid loading the same component twice, the function has an early return statement. However, this does not guarantee atomicity - two fibers or threads may load the component and update the shareable components concurrently. This can lead to use-after-free situations when accessing the component through the shareable components, since the reference stored there is non-owning. This can happen when multiple compaction tasks run on the same SSTable (e.g., regular compaction and scrub-validate). Fix this by not updating the reference in shareable components, if a reference is already in place. Instead, create an owning reference to the existing component for the current fiber. This is less efficient than using a mutex, since the component may be loaded multiple times from disk before noticing the race, but no locks are used for any other SSTable component either. Also, this affects uncompressed SSTables, which are not that common. Fixes #23728. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#23872	2025-05-27 11:26:35 +03:00
Botond Dénes	2739eb49fd	Merge 'docs: remove API reference redirect' from David Garcia Fix for https://github.com/scylladb/scylladb/pull/24097 The stable branch does not contain the split API reference yet. This change fixes the 404 error raised when accessing the API reference on the stable branch due to the redirect. Closes scylladb/scylladb#24259 * github.com:scylladb/scylladb: docs: fix typo docs: remove API reference redirect	2025-05-27 11:24:27 +03:00
Nadav Har'El	8487d81c6e	Merge 'test: mark difference in handling IFs in LWT as scylla_only' from Andrzej Jackowski There is a difference how ScyllaDB and Cassandra handle conditional batches with different IF statements (such as "IF EXISTS" and "IF NOT EXISTS"). Cassandra tries to detect condition conflicts, and prints an error instead of silently failing the batch, but in ScyllaDB we considered this check to be inconsistent and unhelpful, and decided not to implement it. In this series, we extend the documentation of the ScyllaDB behaviour by extending the documents and improving relevant LWT tests. Fixes: https://github.com/scylladb/scylladb/issues/13011 Backport not needed, only docs and minor tests changes. Closes scylladb/scylladb#24086 * github.com:scylladb/scylladb: test: mark difference in handling IFs in LWT as scylla_only docs: cql: add explicit explanation how mixing IFs works in LWT docs: lwt: add two missing spaces	2025-05-27 09:35:41 +03:00
Evgeniy Naydanov	efdb2abdc6	test.py: dtest: make bypass_cache_test.py run using test.py As a part of the porting process, copy missed utility functions from scylla-dtest, remove unused imports and markers, and add single_node marker description to pytest.ini Enable the test in suite.yaml (run in dev mode only)	2025-05-27 05:48:26 +00:00
Evgeniy Naydanov	3a2410324c	test.py: dtest: add missed ScyllaCluster.nodetool() The method executes nodetool command on each running node in a cluster.	2025-05-27 05:48:26 +00:00
Evgeniy Naydanov	6105bb9530	test.py: dtest: copy unmodified bypass_cache_test.py Test is disabled in suite.yaml	2025-05-27 05:48:26 +00:00
Andrzej Jackowski	7dc0c4cf4f	test: close logfile/socket_dir for stopped servers in recycle_cluster PythonTestSuite::recycle_cluster is a function that releases resources of an old, dirty cluster to make it reusable. It closes log_file and maintenance_socket_dir for running nodes in a dirty cluster, however it doesn't do the same for stopped nodes. It leads to leakage of file descriptors of stopped nodes, which in turn can lead to hitting ulimit of open files (that is often 1024) if the leaking test is repeated with `./test.py --repeat ...`. The problem was detected when tests from `test/cluster/dtest/` directory were executed with high `repeat` value. This commit extends `recycle_cluster` to close and cleanup logfile and `socket_dir` for nodes that are stopped (because self.servers in ScyllaCluster is ChainMap of self.running and self.stopped). Closes scylladb/scylladb#24243	2025-05-27 08:37:43 +03:00
David Garcia	d99d1c315c	docs: remove [erno X] prefix from metrics logger Closes scylladb/scylladb#24246	2025-05-27 08:37:11 +03:00
David Garcia	3e331cfbbe	docs: fix typo	2025-05-26 21:34:23 +02:00
David Garcia	eefc9c33e8	docs: remove API reference redirect The stable branch does not contain the split API reference yet. This change fixes the 404 error raised when accessing the API reference on the stable branch.	2025-05-26 21:32:07 +02:00
Andrzej Jackowski	ea6ef5d0aa	test: mark difference in handling IFs in LWT as scylla_only There is a difference how ScyllaDB and Cassandra handle conditional batches with different IF statements (such as "IF EXISTS" and "IF NOT EXISTS"). Cassandra tries to detect condition conflicts, and prints an error instead of silently failing the batch, but in ScyllaDB we considered this check to be inconsistent and unhelpful, and decided not to implement it. This commit: - Make test_lwt_with_batch_conflict_1 scylla_only instead of xfail, change the scenario to pass with the current implementation. - Add test_lwt_with_batch_conflict_3 that shows how Cassandra fails batch statement with different conditions, even when the conditions are not contradictory. - Add test_lwt_with_batch_conflict_4/5 that shows how static rows are handled in conditional batches. Fixes: #13011	2025-05-26 15:47:11 +02:00
Andrzej Jackowski	2d4acb623e	docs: cql: add explicit explanation how mixing IFs works in LWT There is a difference how ScyllaDB and Cassandra handle conditional batches with different IF statements (such as "IF EXISTS" and "IF NOT EXISTS"). This commit explicitly documents the differences in the behavior. Refs: #13011	2025-05-26 15:13:01 +02:00
Piotr Dulikowski	4508823294	Merge 'test.py: dtest: few fixes missed in the initial implementation' from Evgeniy Naydanov There are few problems found in the dtest shim code after scylladb/scylladb#21580 was merged: - The call of `init_default_config()` method was missed in scylladb/scylladb#21580. It is required to handle dtest options and markers. - The implementation of dtest shim uses `server_id` to format a name of a node in a cluster. This is a difference in behavior with dtest. Some of dtests use code like `cluster.nodes()["node1"]` to get access to a node object. - Default timeout was missed in `ScyllaNode.wait_until_stopped()` method. Set it to 600 for debug mode or to 127 otherwise. Closes scylladb/scylladb#24225 * github.com:scylladb/scylladb: test.py: dtest: set default wait_seconds based on build mode test.py: dtest: name nodes in cluster using index starting from 1 test.py: dtest: initialize default config in dtest setup fixture	2025-05-26 13:37:12 +02:00
Yaron Kaikov	89ace09c18	[workflow]: add conflict_reminder to PRs based against `master` Today we send a reminder to PR's author when backport PRs has conflicts. Often, PR authors wait for their PR to be reviewed/merged, but the merge is not happening because the PR now conflicts with master and so maintainers won't merge it. This can lead to a stall, where maintainers wait for the author to rebase and authors are waiting for merge. In this PR we added the ability to notify the PR author as soon as base branch moved forward and rebase is requried Fixes: https://github.com/scylladb/scylla-pkg/issues/4955 Closes scylladb/scylladb#24209	2025-05-26 14:30:06 +03:00
David Garcia	6f722e8bc0	docs: split api reference in smaller files Closes scylladb/scylladb#24097	2025-05-26 12:06:59 +03:00
Radosław Cybulski	90ebea5ebb	Move mutation_fragment::kind into data object Move `mutation_fragment::kind` enum into data object, reducing size of the object from 16 to 8 bytes on current machines.	2025-05-26 11:06:54 +02:00
Radosław Cybulski	ef51bb9bd3	Make mutation_fragment::kind enum 1 byte size Adds std::uint8_t base to `Make mutation_fragment_v2::kind` making it one byte size.	2025-05-26 11:06:54 +02:00
Radosław Cybulski	003e79ac9e	Move mutation_fragment_v2::kind into data object Move `mutation_fragment_v2::kind` enum into data object, reducing size of the object from 16 to 8 bytes on current machines.	2025-05-26 11:06:53 +02:00
Radosław Cybulski	d211119e49	Make mutation_fragment_v2::kind enum 1 byte size Add std::uint8_t as base to `mutation_fragment_v2::kind` enum, which will resize it to 1 byte.	2025-05-26 11:06:53 +02:00
David Garcia	bf9534e2b5	docs: fix \t (tab) is not rendered correctly Closes scylladb/scylladb#24096	2025-05-26 12:06:03 +03:00
Avi Kivity	29932a5af1	pgo: drop Java configuration Since `5e1cf90a51` ("build: replace tools/java submodule with packaged cassandra-stress") we run pre-packaged cassandra-stress. As such, we don't need to look for a Java runtime (which is missing on the frozen toolchain) and can rely on the cassandra-stress package finding its own Java runtime. Fix by just dropping all the Java-finding stuff. Note: Java 11 is in fact present on the frozen toolchain, just not in a way that pgo.py can find it. Fixes #24176. Closes scylladb/scylladb#24178	2025-05-26 10:16:03 +02:00
Avi Kivity	f195c05b0d	untyped_result_set: mark get_blob() as returning unfragmented data Blobs can be large, and unfragmented blobs can easily exceed 128k (as seen in #23903). Rename get_blob() to get_blob_unfragmented() to warn users. Note that most uses are fine as the blobs are really short strings. Closes scylladb/scylladb#24102	2025-05-26 09:40:34 +02:00
Michał Chojnowski	ff8a119f26	test/boost/sstable_compressor_factory_test: define a test suite name It seems that tests in test/boost/combined_tests have to define a test suite name, otherwise they aren't picked up by test.py. Fixes #24199 Closes scylladb/scylladb#24200	2025-05-26 09:35:30 +02:00
Anna Stuchlik	d303edbc39	doc: remove copyright from Cassandra Stress This commit removes the Apache copyright note from the Cassandra Stress page. It's a follow up to https://github.com/scylladb/scylladb/pull/21723, which missed that update (see https://github.com/scylladb/scylladb/pull/21723#discussion_r1944357143). Cassandra Stress is a separate tool with separate repo with the docs, so the copyright information on the page is incorrect. Fixes https://github.com/scylladb/scylladb/issues/23240 Closes scylladb/scylladb#24219	2025-05-26 09:35:30 +02:00
Pavel Emelyanov	2a253ace5e	Merge 'test.py: add coverage for boost with pytest execution' from Andrei Chekun This PR adds the possibility to gather coverage for the boost tests when they're executed with pytest. Since the pytest will be used as the main runner for boost tests as well, we need this before switching the runners. Closes scylladb/scylladb#24236 * github.com:scylladb/scylladb: test.py: add support for coverage for boost test test.py: get the temp dir from facade	2025-05-26 10:18:53 +03:00
Andrei Chekun	537054bfad	test.py: add support for coverage for boost test This PR adds the possibility to gather coverage for the boost tests when they're executed with pytest. Since the pytest will be used as the main runner for boost tests as well, we need this before switching the runners.	2025-05-23 12:54:54 +02:00
Andrei Chekun	c5a7f3415c	test.py: get the temp dir from facade No need to get the temp dir from the options when facade has this information already.	2025-05-23 12:54:48 +02:00
Nadav Har'El	d2844055ad	Merge 'index: implement schema management layer for vector search indexes' from null This pull request adds support for creating custom indexes (at a metadata level) as long as a supported custom class is provided (currently only vector search). The patch contains: - a change in CREATE INDEX statement that allows for the USING keyword to be present as long as one of the supported classes is used - support for describing custom indexes in the DESCRIBE statement - unit tests Co-authored by: @Balwancia Closes scylladb/scylladb#23720 * github.com:scylladb/scylladb: test/cqlpy: add custom index tests index: support storing metadata for custom indices	2025-05-22 12:19:36 +03:00
Pavel Emelyanov	a0d2e63303	Merge 'test.py: add the possibility to gather resource metrics for C++ tests' from Andrei Chekun Move the run_process method to resource gather instance, since we need to start a monitor to check memory consumption in the cgroup. Pytest has concept of the test, but it is completely different from test.py. Resource gather instance take test instance to save and extract information about the test. Additional method emulating test.py test instance added not to rewrite the resource gather instance. Finally, combining all these changes to have ability to get metrics for test in both runners: test.py and pytest. Closes scylladb/scylladb#24091 * github.com:scylladb/scylladb: test.py: add missing parameter for boost tests for pytest runner test.py: add support for boost_data_test_case in combined tests test.py: clean log files after a successful run test.py: attach output of the boost test to the report test.py: fix metrics DB location test.py: move run_process to resource_gather.py test.py: unify using constant for finding repo root directory test.py: refactor run_process in facade.py test.py: add the possibility to create a test alike object	2025-05-22 10:34:34 +03:00
Evgeniy Naydanov	8dc5413f54	test.py: dtest: set default wait_seconds based on build mode Default timeout was missed in `ScyllaNode.wait_until_stopped()` method. Set it to 600 for debug mode or to 127 otherwise.	2025-05-22 06:39:03 +00:00
Evgeniy Naydanov	eca5d52f1d	test.py: dtest: name nodes in cluster using index starting from 1 The current implementation of dtest shim use `server_id` to format a name of a node in a cluster. This is a difference in behavior with dtest. Some of dtests use code like `cluster.nodes()["node1"]` to get access to a node object. This commit changes it to be more consistent with dtest.	2025-05-22 06:34:03 +00:00
Evgeniy Naydanov	91e29a302a	test.py: dtest: initialize default config in dtest setup fixture The call of `init_default_config()` method was missed in #21580. It is required to handle dtest options and markers.	2025-05-22 06:22:04 +00:00
Andrei Chekun	8812b14078	test.py: add missing parameter for boost tests for pytest runner Since we are running tests with a pytest, we don't need a report at the end of the run.	2025-05-21 19:41:41 +02:00
Andrei Chekun	66b014621e	test.py: add support for boost_data_test_case in combined tests Change the parsing logic of combined tests to support a case when boost_data_test_case used that produced additional lines in the output.	2025-05-21 19:41:41 +02:00
Andrei Chekun	88d24d8ad5	test.py: clean log files after a successful run Clean different output files from the boost and unit tests. Move logs for boost test to the testlog directory instead of having additional directory pytest	2025-05-21 19:41:41 +02:00
Andrei Chekun	a956dd8770	test.py: attach output of the boost test to the report Added attaching the output of the test in case of fail to the Allure report	2025-05-21 19:41:39 +02:00
Andrei Chekun	ac86cc9f6d	test.py: fix metrics DB location Fix the issue introduced with scylladb/scylladb#22960. Suite log dir was changed, and the path for metrics DB was relying on it. As a result, DB is now located in the mode directory instead of the root of the testlog.	2025-05-21 15:37:15 +02:00
Andrei Chekun	b5b69710bd	test.py: move run_process to resource_gather.py Move the run_process method to the resource gather instance, since we need to start monitor to check memory consumption in the cgroup. Since resource_gather needs test.py test object, and pytest has no clue about it, adding a simple namespace object to emulate such a test object. It needed only to gather some information regarding the test to be able to add records to the DB. Since we have two facades that can share the same run process procedure, adding a common method to handle this to avoid code duplication.	2025-05-21 15:34:34 +02:00
Andrei Chekun	3bcd6db718	test.py: unify using constant for finding repo root directory Instead of finding dynamically the repo root directory relatively to the temp dir, that's in most cases in the repo, will fail if a non-default temp dir parameter is used. Additionally, to have the single source of truth of finding the repo root directory switching to the constants.	2025-05-21 15:34:34 +02:00
Andrei Chekun	4e18444831	test.py: refactor run_process in facade.py Add injecting environment variables to the process Switch from print to propper logger Set buffer size to 1 to avoid losing any data from the boost test if the test collapsed. Currently, run process logs and return stdout and stderr, but boost tests are using stderr only. So stderr redirected to stdout. This helps with Jenkins as well, since we are reducing the number of files to store.	2025-05-21 15:34:34 +02:00
Andrei Chekun	38310975c5	test.py: add the possibility to create a test alike object resource_gather.py needs test.py test object to work. It needs some information about the test to be able to write down this information to the DB with metrics. When running with pytest, there's no such test object, that's why adding make_test_object to mimic the test.py's test object. Switching the getting the mode for constructing path to chgroup to test instead of suite. They are the same, but this helps to have emulate less in make_test_object method.	2025-05-21 15:34:34 +02:00
Pavel Emelyanov	dac7589cef	Revert "encryption_test: Catch exact exception" This reverts commit `2d5c0f0cfd`. KMS tests became flaky after it: #24218 Need to revisit.	2025-05-20 13:52:14 +03:00
Petr Gusev	0443081b0d	build: fix merge-compdb.py for CMake 'output' attributes compile_commands.json is used by LSPs (e.g. `clangd` in VS Code) for code navigation. `merge-compdb.py`, called by `configure.py`, merges these files from Scylla, Seastar, and Abseil. The script filters entries by checking the output attribute against a given prefix. This is needed because Scylla’s compile_commands.json is generated by Ninja and includes all build modes, in case the user specified multiple ones in the call to configure.py. Seastar and Abseil databases, generated by CMake, used to omit the output attribute, so filtering did not apply. Starting with `CMake 3.20+`, output attributes are now included and do not match the expected prefix. For example, they could be of the form `absl/synchronization/CMakeFiles/synchronization.dir/internal/futex_waiter.cc.o`. This causes relevant entries from Seastar and Abseil to be filtered out. This patch refactors `merge-compdb.py` to allow specifying an optional prefix per input file, preserving the intent of applying the output filtering logic only for ninja-generated Scylla compdb file. Closes scylladb/scylladb#24211	2025-05-20 08:43:09 +03:00
Piotr Dulikowski	c15cf54e3d	Merge 'test.py: migrate alternator_tests.py from dtest suite' from Evgeniy Naydanov We have a significant amount of tests in scylla-dtest repository and I believe most of them can be just copied to test.py framework with adding a relatively small shim code. In this PR I done that for 2 tests: [alternator_tests.py](https://github.com/scylladb/scylla-dtest/blob/next/alternator_tests.py) and [error_example_test.py](https://github.com/scylladb/scylla-dtest/blob/next/error_example_test.py) One of the problems is async nature of test.py framework and synchronous of scylla-dtest. It was resolved by using universalasync third-party library. Other problem is ccmlib and it's resolved by adding a shim code (`test/dtest/ccmlib`) ccmlib has a lot of dead code and not all it's features used by scylla-dtest, in this PR I added checks that we will not accidentally use some of them or miss something. And when we'll done the migration we can easily remove all unused parameters and these checks. `error_example_test.py` copied as is (just license preamble added), `alternator_tests.py` has small changes: 1. License preamble 2. Remove unused imports 3. Remove unneeded `skip_if` marker (I think it can be backported to dtest, or we can remove the test from dtest after merging this PR) ```diff --- ../../../scylla-dtest/alternator_tests.py +++ alternator_tests.py @@ -1,17 +1,20 @@ +# +# Copyright (C) 2025-present ScyllaDB +# +# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0 +# + import logging import operator import os import random -import shutil import string -import subprocess import tempfile import time from ast import literal_eval from concurrent.futures.thread import ThreadPoolExecutor from copy import deepcopy from decimal import Decimal -from pathlib import Path from pprint import pformat import boto3.dynamodb.types @@ -46,7 +49,6 @@ ) from dtest_class import get_ip_from_node, wait_for from tools.cluster import new_node -from tools.marks import issue_open, with_feature from tools.misc import set_trace_probability from tools.retrying import retrying @@ -168,7 +170,6 @@ read_and_delete_set_elements_thread.join() @pytest.mark.next_gating - @pytest.mark.skip_if(with_feature("tablets") & issue_open("#18002")) def test_decommission_during_dynamo_load(self): self.prepare_dynamodb_cluster(num_of_nodes=3) node1, node2, node3 = self.cluster.nodelist() ``` Because all tests in this repo are considered to be "gating", I removed all not next_gating tests and all dtest's suites markers as a separate commit. To reduce tests execution time run the tests in dev mode only and made some sleeps smaller. In result, 23 tests added in total (22 in `test_alternator.py` and 1 in `test_error_example`.) The added tests will increase CI time by ~2х4 =8 minutes. Closes scylladb/scylladb#21580 * github.com:scylladb/scylladb: test.py: dtest/alternator_tests.py: make sleep intervals smaller test.py: dtest/alternator_tests.py: remove not next_gating tests test.py: migrate alternator_tests.py from dtest test.py: initial implementation of dtest/ccm shim test.py: manager: add server_get_returncode() method test.py: manager: change CLI and env options on a node start test.py: REST API: add set_trace_probability() method test.py: REST API: add get_tokens() method test.py: rework log_browsing for dtest migration	2025-05-20 00:13:16 +02:00
Evgeniy Naydanov	e456f0ed7b	test.py: dtest/alternator_tests.py: make sleep intervals smaller	2025-05-19 12:27:32 +00:00
Evgeniy Naydanov	8dd86818a0	test.py: dtest/alternator_tests.py: remove not next_gating tests Remove all not next_gating tests and remove any dtest suites markers because all tests in this repo are considered to be "gating".	2025-05-19 12:27:32 +00:00
Evgeniy Naydanov	57c1035146	test.py: migrate alternator_tests.py from dtest The test almost unmodified except remove unneeded skipif mark and unused imports.	2025-05-19 12:27:32 +00:00
Evgeniy Naydanov	ac1551892b	test.py: initial implementation of dtest/ccm shim Use universalasync library to make test.py async code compatible with synchronous code of dtest/ccm Also, copied unmodified error_example_test.py from dtest as an example. Run the test in `dev` mode only.	2025-05-19 12:27:31 +00:00
Evgeniy Naydanov	2cb640f95c	test.py: manager: add server_get_returncode() method The method return None if Scylla process is still running or returncode. If there is no Scylla process launched then raise NoSuchProcess exception.	2025-05-19 11:50:55 +00:00
Evgeniy Naydanov	d874beb17f	test.py: manager: change CLI and env options on a node start Add parameters to server_start() method to provide ability to change Scylla' CLI and env options on a node start. Also, add `expected_server_up_state` parameter as we have for server_add() method.	2025-05-19 11:50:55 +00:00
Evgeniy Naydanov	5d3b54aa9b	test.py: REST API: add set_trace_probability() method	2025-05-19 11:50:55 +00:00
Evgeniy Naydanov	a16a4b6171	test.py: REST API: add get_tokens() method Get a list of the tokens for the specified node. Optional `endpoint` parameter can be provided.	2025-05-19 11:50:55 +00:00
Evgeniy Naydanov	f6e3fdd778	test.py: rework log_browsing for dtest migration Rework `ScyllaLogFile.wait_for()` method to make it easier to add required methods to ScyllaNode class of ccm-like shim. Also, added `ScyllaLogFile.grep_for_errors()` method and reworked `ScyllaLogFile.grep()`	2025-05-19 11:50:55 +00:00
Łukasz Paszkowski	0a2f0c6852	nodetool: Refactor of compactionhistory_operation Simplify code by using std::apply that unpacks std::array into separate items to pass further to a callable. This simplifies the code that looks: fmt::print(std::cout, fmt::runtime(header_row_format.c_str()), header_row[0], header_row[1], header_row[2], header_row[3], header_row[4], header_row[5], header_row[6], header_row[7], header_row[8], header_row[9], header_row[10], header_row[11], header_row[12], header_row[13]); into something like: std::apply(fh, header_row);	2025-05-16 20:00:00 +02:00
Łukasz Paszkowski	edb666f461	nodetool: Add more stats into compactionhistory output Incorporate additional statistics stored in the compaction_history system table. Depending on the requested format type, the output has different form. Remove unnecessary duplicated history_entry struct and instead use extracted db::compaction_history_entry structure. Running the cql command: select * from system.compaction_history; prints sstable's generation type as UUID (e.g. 5a5cf800-b617-11ef-a97d-8438c36f0e31), see generation_type::data_value() which is different than its fmt format (e.g. 3glx_0srx_1pasg2ksepk902v8dt). Therefore, to unify the outputs, generation_type is converted to data_value before it is printed.	2025-05-16 20:00:00 +02:00
Łukasz Paszkowski	583cc675ce	api/compaction_manager: Extend compaction_history api Extend api of /compaction_manager/compaction_history to include newly added columns to the compaction history table from the previous patches.	2025-05-16 20:00:00 +02:00
Łukasz Paszkowski	2793369288	compaction: Collect tombstone purge stats during compaction Collect tombstone purge statistics like + total number of purge attempts + number of purge failures due to data overlapping with memtables + number of purge failures due to data overlapping with non-compacting sstables and expose them in the compaction_stats structure.	2025-05-16 20:00:00 +02:00
Łukasz Paszkowski	6b729fabc9	compacting_reader: Extend to accept tombstone purge statistics Extends the make_compacting_reader funtion and the constructor of the compacting_reader, in order to accept an optional pointer to the tombstone purge statistics structure that is later passed further down to compact_mutation_state.	2025-05-16 20:00:00 +02:00
Łukasz Paszkowski	546b2c191f	mutation_compactor: Collect tombstone purge attempts Let compact_mutation_state collect all tombstone purge attempts and failures. For this purpose a new statistic structure is created (tombstone_purge_stats) and the relative stats are collected in the can_purge_tombstone method. The statistics are collect only for sstables compaction. An optional statistics structure can be passed in via compact_mutation_state constructor.	2025-05-16 20:00:00 +02:00
Łukasz Paszkowski	503d4f014c	compaction_garbage_collector: Extend return type of max_purgeable_fn Currently, when a max purgeable timestamp is computed, there is no information where it comes from and how the value was obtained. Take compaction, if there are memtables or other uncompacting sstables possibly shadowing data, the timestamp is decreased to ensure a tombstone is not purged but the caller does not know what that the timestamp has its value. In this patch, we extend the return type of max_purgeable_fn to contain not only a timestamp but also an information on how it was computed. This information will be required to collect statistics on tombstone purge failures due to overlapping memtables/uncompacting sstables that come later in the series.	2025-05-16 19:59:54 +02:00
Anna Stuchlik	2d7db0867c	doc: fix the product name for version 2025.1 Starting with 2025.1, ScyllaDB versions are no longer called "Enterprise", but the OS support page still uses that label. This commit fixes that by replacing "Enterprise" with "ScyllaDB". This update is required since we've removed "Enterprise" from everywhere else, including the commands, so having it here is confusing. Fixes https://github.com/scylladb/scylladb/issues/24179 Closes scylladb/scylladb#24181	2025-05-16 12:16:00 +02:00
Avi Kivity	37f9cf6de6	dist: rpm: override %_sbindir for Fedora 42 Fedora 42 merged /usr/sbin into /usr/bin [1]. As part of that change the rpm macro %_sbindir was redefined from /usr/sbin to /usr/bin. As a result RPM build on Fedora 42 fails: install.sh places some files into /usr/sbin, while rpmbuild looks for them in /usr/bin. We could resolve this either by following the change and moving the files to /usr/bin as well, or fixing the spec to place the files in /usr/sbin. The former is more difficult: - what about Debian/Ubuntu? - what about older RPM-based distributions (like all RHEL distributions)? - what about scripts that hard-code /usr/sbin/<scylla utility>? So we pick the latter, and redefine %_sbindir to /usr/sbin. Since that directory still exists (as a symlink), installation on systems with merged /usr/bin and /usr/sbin will work. We'll have to address the problem later (likely by installing to either /usr/bin or /usr/sbin depending on context), but for now, this is a simple solution that works everywhere. [1] https://fedoraproject.org/wiki/Changes/Unify_bin_and_sbin Closes scylladb/scylladb#24101	2025-05-16 12:05:29 +02:00
Aleksandra Martyniuk	9c03255fd2	cql_test_env: main: move stream_manager initialization Currently, stream_manager is initialized after storage_service and so it is stopped before the storage_service is. In its stop method storage_service accesses stream_manager which is uninitialized at a time. Move stream_manager initialization over the storage_service initialization. Fixes: #23207. Closes scylladb/scylladb#24008	2025-05-15 17:17:35 +03:00
Avi Kivity	4f87362abb	compaction_manager: drop gratuitous conversion from interval to wrapped_interval The conversion is unnecessary and likely dates back from before the split between interval and wrapped_interval. It gets in the way of making the conversion explicit. Closes scylladb/scylladb#24164	2025-05-15 16:15:55 +03:00
Nadav Har'El	27ad772a66	test/cqlpy: fix "run --release 2025.1" This patch fixes "test/cqlpy/run --release 2025.1" which fails as follows on all tests with indexes or views: Secondary indexes are not supported on base tables with tablets test/cqlpy/run can run cqlpy (and alternator) tests on various official releases of Scylla which it knows how to download. When running old versions of Scylla, we need to change the configuration options to those that were needed on specific versions. On new versions of Scylla we need to pass --experimental-features=views-with-tablets to be able to test materialized views, but in older versions we need to remove that parameter because it didn't exist. We incorrectly removed it for any versions 2025.1 or earlier, but that's incorrect - it just needs to be removed for versions strictly earlier than 2025.1 - it is needed for 2025.1 (I tested it is indeed needed even in the earliers RCs). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24144	2025-05-15 16:13:01 +03:00
Pavel Emelyanov	2f5b452c7c	api: Drop class req_params It's not unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:08:52 +03:00
Pavel Emelyanov	9628c3a4a5	api: Stop using req_params in parse_scrub_options The "keyspace" and "cf" pair of options are now parsed similarly to how recently changed ss::force_keyspace_compaction handler does. The "scrub_mode" query param is saved directly into sstring variable and its presense is checked by .empty() call. If the parameter is missing, the request::get_query_param() would return empty string, so the change is correct. The "skip_corrupted" is boolean option, other options are already parsed by hand, without the help of req_params facilities. There's a test that validates the work of req_params::process() of scrub endpoint -- it passes "invalid" options. This test is temporarily removed according to the PR description. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:07:57 +03:00
Pavel Emelyanov	fd0128849e	api: Stop using req_params in tasks::force_keyspace_compaction_async This handler is in fact duplicates the cf::force_major_compaction in how it parses its options, so the change is the same. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:07:53 +03:00
Pavel Emelyanov	09c9a5baa7	api: Stop using req_params in ss::force_keyspace_compaction The "keyspace" mandatory param and "cf" query one are used, respectively, to get and validate keyspace and to parse table infos. Both actions can be used with the corresponding parse_table_infos() overload. Other parameters are boolean query ones and can be parsed directly. By and large this change repeats the change in cf::force_major_compaction done previously. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:07:52 +03:00
Pavel Emelyanov	f7e8d6ba09	api: Stop using req_params in ss::force_compaction This handler only has two query parameters that can be parsed using the validate_bool_x helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:07:52 +03:00
Pavel Emelyanov	a320550bd1	api: Stop using req_params in cf::force_major_compaction The mandatory "name" parameter can be picked directly from request path params, as described in the PR description. The "split_output" is placeholder and is just checked for being there at all, without any parsing. Other parameters are query ones too, and are parsed with the help of recently introduced validate_bool_x helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:07:52 +03:00
Pavel Emelyanov	253c82f03a	api: Add validate_bool_x() helper There's validate_bool() one that converts "true" to true and "false" to false. This helper mimics the req_params' parser of bool and renders true from "true", "yes" or "1" and false from "false", "no" or "0" (all case insensitively). Unlike its prototype, which renders disengaged optional bool in case the parameter is empty, this helper returns the passed default value. Will replace the req_params eventually. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:07:52 +03:00
Botond Dénes	697945820b	Merge 'utils: chunked_vector: add some modifiers' from Avi Kivity chunked_vector is a replacement for std::vector that avoids large contiguous allocations. In this series, we add some missing modifiers and improve quality-of-life for chunked_vector users (the static_assert patch). Those modifiers were generally unused since they have O(n) complexity and therefore not useful for hot paths, but they are used in some control plane code on vectors which we'd like to replace with chunked_vectors. A candidate for such a replacement is token_range_vector (see #3335). This is a prerequisite for fixing some minor stalls; I don't expect we'll backport fixes to those stalls. Closes scylladb/scylladb#24162 * github.com:scylladb/scylladb: utils: chunked_vector: add swap() method utils: chunked_vector: add range insert() overloads utils: chunked_vector: relax static_assert utils: chunked_vector: implement erase() for single elements and ranges utils: chunked_vector: implement insert() for single-element inserts	2025-05-15 09:42:14 +03:00
Yaron Kaikov	f124b073b1	toolchain: set `scylla-driver` release based on tools/cqlsh In `install-dependencies.sh` we use hardcoded `scylla-driver` release. this version should be identical to `tools/cqlsh/requirements.txt` value. It's better to have once source for `scylla-driver` version. upading `install-dependancies.sh` to use the release from `tools/cqlsh` directly Removing `geomet` hardcoded version Also removing the support for `s390x` arch as we never use it Frozen toolchain regenerated. Optimized clang from * https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz * https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz Closes scylladb/scylladb#23841	2025-05-15 06:08:14 +03:00
Pavel Emelyanov	2e83b0367f	api: Use structured bindings in get_built_indexes() code Shorter this way Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24155	2025-05-14 19:03:13 +03:00
Wojciech Mitros	5920647617	mv: remove queue length limit from the view update read concurrency semaphore Each view update is correlated to a write that generates it (aside from view building which is throttled separately). These writes are limited by a throttling mechanism, which effectively works by performing the writes with CL=ALL if ongoing writes exceed some memory usage limit When writes generate view updates, they usually also need to perform a read. This read goes through a read concurrency semaphore where it can get delayed or killed. The semaphore allows up to 100 concurrent reads and puts all remaining reads in a queue. If the number of queued reads exceeds a specific limit, the view update will fail on the replica, causing inconsistencies. This limit is not necessary. When a read gets queued on the semaphore, the write that's causing the view update is paused, so the write takes part in the regular write throttling. If too many writes get stuck on view update reads, they will get throttled, so their number is limited and the number of queued reads is also limited to the same amount. In this patch we remove the specified queue length limit for the view update read concurrency semaphore. Instead of this limit, the queue will be now limited indirectly, by the base write throttling mechanism. This may allow the queue grow longer than with the previous limit, but it shouldn't ever cause issues - we only perform up to 100 actual reads at once, and the remaining ones that get queued use a tiny amount of memory, less than the writes that generated them and which are getting limited directly. Fixes https://github.com/scylladb/scylladb/issues/23319 Closes scylladb/scylladb#24112	2025-05-14 18:29:30 +03:00
Botond Dénes	700a5f86ed	tools/scylla-nodetool: status: handle negative load sizes Negative load sizes don't make sense, but we've seen a case in production, where a negative number was returned by ScyllaDB REST API, so be prepared to handle these too. Fixes: scylladb/scylladb#24134 Closes scylladb/scylladb#24135	2025-05-14 18:28:29 +03:00
Avi Kivity	70be73d036	Merge 'Refactor out code from `test_restore_with_streaming_scopes`' from Robert Bindar Lots of code from this test can be reused in PR #23861. I'm splitting it now in this change so we can merge it cleanly as a separate patch. Refs #23564 Closes scylladb/scylladb#24105 * github.com:scylladb/scylladb: Refactor out code from test_restore_with_streaming_scopes Refactor out code from test_restore_with_streaming_scopes Refactor out code from test_restore_with_streaming_scopes Refactor out code from test_restore_with_streaming_scopes Refactor out code from test_restore_with_streaming_scopes	2025-05-14 18:10:53 +03:00
Botond Dénes	9f8de9adc8	Merge 'Add ability to skip SSTables cleanup when loading them' from Pavel Emelyanov The non-streaming loading of sstables performs cleanup since recently [1]. For vnodes, unfortunately, cleanup is almost unavoidable, because of the nature of vnodes sharding, even if sstable is already clean. This leads to waste of IO and CPU for nothing. Skipping the cleanup in a smart way is possible, but requires too many changes in the code and in the on-disk data. However, the effort will not help existing SSTables and it's going to be obsoleted by tablets some time soon. Said that, the easiest way to skip cleanup is the explicit --skip-cleanup option for nodetool and respective skip_cleanup parameter for API handler. New feature, no backport fixes #24136 refs #12422 [1] Closes scylladb/scylladb#24139 * github.com:scylladb/scylladb: nodetool: Add refresh --skip-cleanup option api: Introduce skip_cleanup query parameter distributed_loader: Don't create owned ranges if skip-cleanup is true code: Push bool skip_cleanup flag around	2025-05-14 16:47:34 +03:00
Avi Kivity	13a75ff835	utils: chunked_vector: add swap() method Following std::vector(), we implement swap(). It's a simple matter of swapping all the contents. A unit test is added.	2025-05-14 16:19:40 +03:00
Avi Kivity	24e0d17def	utils: chunked_vector: add range insert() overloads Inserts an iterator range at some position. Again we insert the range at the end and use std::rotate() to move the newly inserted elements into place, forgoing possible optimizations. Unit tests are added.	2025-05-14 16:19:40 +03:00
Avi Kivity	9425a3c242	utils: chunked_vector: relax static_assert chunked_vector is only implemented for types with a non-throwing move constructor; this greatly simplifies the implementation. We have a static_assert to enforce it (should really be a constraint, but chunked_vector predates C++ concepts). This static_assert prevents forward declarations from compiling: class forward_declared; using a = utils::chunked_vector<forward_declared>; `a` won't compile since the static_assert will be instantiated and will fail since forward_declared is an incomplete type. Using a constraint has the same problem. Fix by moving the static_assert to the destructor. The destructor won't be instantiated by the forward declaration, so it won't trigger. It will trigger when someone destroys the vector; at this point the types are no longer forward declared.	2025-05-14 16:19:40 +03:00
Avi Kivity	d6eefce145	utils: chunked_vector: implement erase() for single elements and ranges Implement using std::rotate() and resize(). The elements to be erased are rotated to the end, then resized out of existence. Again we defer optimization for trivially copyable types. Unit tests are added. Needed for range_streamer with token_ranges using chunked_vector.	2025-05-14 16:19:37 +03:00
Botond Dénes	b491ae1039	Merge 'raft_sys_table_storage: avoid temp buffer when deserializing log_entry' from Petr Gusev The get_blob method linearizes data by copying it into a single buffer, which can cause 'oversized allocation' warnings. In this commit we avoid copying by creating input stream on top of the original fragmened managed bytes, returned by untyped_result_set_row::get_view. fixes scylladb/scylladb#23903 backport: no need, not a critical issue. Closes scylladb/scylladb#24123 * github.com:scylladb/scylladb: raft_sys_table_storage: avoid temporary buffer when deserializing log_entry serializer_impl.hh: add as_input_stream(managed_bytes_view) overload	2025-05-14 15:10:47 +03:00
Avi Kivity	5301f3d0b5	utils: chunked_vector: implement insert() for single-element inserts partition_range_compat's unwrap() needs insert if we are to use it for chunked_vector (which we do). Implement using push_back() and std::rotate(). emplace(iterator, args) is also implemented, though the benefit is diluted (it will be moved after construction). The implementation isn't optimal - if T is trivially copyable then using std::memmove() will be much faster that std::rotate(), but this complex optimization is left for later. Unit tests are added.	2025-05-14 14:54:59 +03:00
Robert Bindar	548a1ec20a	Refactor out code from test_restore_with_streaming_scopes part 5: check_data_is_back Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-05-14 11:39:01 +03:00
Robert Bindar	29309ae533	Refactor out code from test_restore_with_streaming_scopes part 4: compute_scope Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-05-14 11:39:01 +03:00
Robert Bindar	a0f0580a9c	Refactor out code from test_restore_with_streaming_scopes part 3: create_dataset Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-05-14 11:38:59 +03:00
Robert Bindar	5171ca385a	Refactor out code from test_restore_with_streaming_scopes part 2: take_snapshot Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-05-14 11:31:19 +03:00
Robert Bindar	f09bb20ac4	Refactor out code from test_restore_with_streaming_scopes part 1: create_cluster Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-05-14 11:30:40 +03:00
Andrzej Jackowski	8b660f0af7	test: add tests for prepared statement metadata consistency corner cases Implement corner-cases of prepared statement metadata, as described in scylladb#20860. Although the purpose of the test was to verify the newly implemented SCYLLA_USE_METADATA_ID protocol extension, the test also passes with scylla-driver 3.29.3 that doesn't implement the support for this extension. That is because the driver doesn't implement support for skip_metadata flag, so fresh metadata are included in every prepared statement response, regardless of the metadata_id. This change: - Add test_changed_prepared_statement_metadata_columns to verify a scenario when a number of columns changes in a table used by a prepared statement - Add test_changed_prepared_statement_metadata_types to verify a scenario when a type of a column changes in a table used by a prepared statement - Add test_changed_prepared_statement_metadata_udt to veriy a scenario when a UDT changes in a table used by a prepared statement I tested the code with a modified Python driver (ref. scylladb/python-driver#457): - If SKIP_METADATA is enabled (scylladb/python-driver@c1809c1) but not other changes are introduced, all three test cases fail. - If SKIP_METADATA is disabled (no scylladb/python-driver@c1809c1) all test cases pass because fresh metadata are included in each reply. - If SKIP_METADATA is enabled (scylladb/python-driver@c1809c1) and SCYLLA_USE_METADATA_ID extension is included (scylladb/python-driver@8aba164) all test cases pass and verifies the correctness the implementation.	2025-05-14 09:59:19 +02:00
Andrzej Jackowski	086df24555	transport: implement SCYLLA_USE_METADATA_ID support Metadata id was introduced in CQLv5 to make metadata of prepared statement consistent between driver and database. This commit introduces a protocol extension that allows to use the same mechanism in CQLv4. This change: - Introduce SCYLLA_USE_METADATA_ID protocol extension for CQLv4 - Introduce METADATA_CHANGED flag in RESULT. The flag cames directly from CQLv5 binary protocol. In CQLv4, the bit was never used, so we assume it is safe to reuse it. - Implement handling of metadata_id and METADATA_CHANGED in RESULT rows - Implement returning metadata_id in RESULT prepared - Implement reading metadata_id from EXECUTE - Added description of SCYLLA_USE_METADATA_ID in documentation Metadata_id is wrapped in cql_metadata_id_wrapper because we need to distinguish the following situations: - Metadata_id is not supported by the protocol (e.g. CQLv4 without the extension is used) - Metadata_id is supported by the protocol but not set - e.g. PREPARE query is being handled: it doesn't contain metadata_id in the request but the reply (RESULT prepared) must contain metadata_id - Metadata_id is supported by the protocol and set, any number of bytes >= 0 is allowed, according to the CQLv5 protocol specification Fixes scylladb/scylladb#20860	2025-05-14 09:59:16 +02:00
Andrzej Jackowski	c32aba93b4	cql3: implement metadata::calculate_metadata_id() CQLv5 introduced metadata_id, which is a checksum computed from column names and types, to track schema changes in prepared statements. This commit introduces calculate_metadata_id to compute such id for given metadata. Please note that calculate_metadata_id() produces different hashes than Cassandra's computeResultMetadataId(). We use SHA256 truncated to 128 bits instead of MD5. There are also two smaller technical differences: calculate_metadata_id() doesn't add unneeded zeros and it adds a length of a string when an sstring is being fed to the hasher. The difference is intentional because MD5 has known vulnerabilities, moreover we don't want to introduce any dependency between our metadata_id and Cassandra's. This change: - Add cql_metadata_id_type - Implement metadata::calculate_metadata_id() - Add boost tests to confirm correctness of the function	2025-05-14 09:33:16 +02:00
Michał Hudobski	8ea862f1e8	test/cqlpy: add custom index tests Unit tests checking the behavior of the added support for create custom index statement	2025-05-14 09:32:01 +02:00
Michał Hudobski	05daa8dded	index: support storing metadata for custom indices Added function returning custom index class name. Added printing custom index class name when using DESCRIBE. Changed validation to reflect current support of indices.	2025-05-14 09:32:00 +02:00
Łukasz Paszkowski	0327964d57	compaction: Extend compaction_result to collect more information The compaction_result struct has been extended with the following properties: + id of the shard the compaction took place on + type of the compaction + time when the compaction started + list of sstable files to be compacted + list of sstable files generated by compaction	2025-05-14 08:32:07 +02:00
Łukasz Paszkowski	0490068982	system_keyspace: Upgrade compaction_history table Currently, the system.compaction_history table miss precious information like the type of compaction (cleanup, major, resharding, etc) or the sstable generations involved (in and out) used countless times to diagnose issues. Thus, the commit extend the current definition of the table by adding the following columns: + "compaction_type" (text) + "started_at" (int) + "shard_id" (int) + "sstables_in" (list<sstableinfo_type>) + "sstables_out" (list<sstableinfo_type>) + "total_tombstone_purge_attempt" (long) + "total_tombstone_purge_failure_due_to_overlapping_with_memtable" (long) + "total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable" (long) Furthermore, the commit introduces a new feature flag in order to prevent nodes from writing data to new columns when a cluster is not fully upgraded.	2025-05-14 08:32:05 +02:00
Łukasz Paszkowski	28d0c98dab	system_keyspace: Create UDT: sstableinfo_type The new user defined type holds the following information on sstable: + generation uuid; + origin text; + size long; and will be used by the system.compaction_history table to keep track of compacted files and the files being the result of this compaction.	2025-05-14 08:31:40 +02:00
Łukasz Paszkowski	dc6f8881b8	system_keyspace: Extract compaction_history struct Move the compaction_history_entry struct to a seperate file. The intent of this change is to later re-use it in scylla-nodetool as it currently defines its own structure that is very similar.	2025-05-14 08:31:40 +02:00
Łukasz Paszkowski	4c93b5292d	system_keyspace: Squeeze update_compaction_history parameters Since the number of statistics inserted into compaction_history table grows in time, the number of parameters in the method update_compaction_history grows as well. So instead, let's re-use the already existing compaction_history_entry structure to populate data from the compaction_manager to the system table.	2025-05-14 08:31:40 +02:00
Łukasz Paszkowski	342e9a3f5c	compaction/compaction_manager: update_history accepts compaction_result as rvalue The compaction_result struct holding compaction's results and statistics is obtained immediatelly before the update_history is called. Move it instead of passing a cont reference.	2025-05-14 08:31:40 +02:00
Andrzej Jackowski	f8f710c95e	test: simplify pytest params in test_long_query_timeout_erm One of pytest parameters in test_long_query_timeout_erm.py was a CQL query containing spaces and special chars such as '', '(', ')', '{', '}'. After upgrading to Fedora 42, the test started to fail with the error "test.pylib.rest_client.HTTPError: HTTP error 404" with uri=`http://...[SELECT FROM {}-True-False].dev.1`. To prevent from such errors, this commit changes the parameter to a string without spaces and such special characters. Fixes: scylladb/scylladb#24124 Closes scylladb/scylladb#24130	2025-05-13 21:44:15 +03:00
Benny Halevy	2ceecc9d2a	generic_server: server: do_accepts: prevent gate_closed_exception do_accepts might be called after `_gate` was closed. In this case it should just return early rather than throw gate_closed_exception, similar to the it breaks from the infinite for loop when the _gate is closed. With this change, do_accepts (and consequently, _listeners_stopped), should never fail as it catches and ignores all exceptions in the loop. Fixes #23775 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23818	2025-05-13 20:00:04 +03:00
Pavel Emelyanov	c0796244bb	nodetool: Add refresh --skip-cleanup option The option "conflicts" with load-and-stream. Tests and doc included. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-13 19:07:38 +03:00
Pavel Emelyanov	1b1f653699	api: Introduce skip_cleanup query parameter Just copy the load_and_stream and primary_replica_only logic, this new option is the same in this sense. Throw if it's specified with the load_and_stream one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-13 17:06:28 +03:00
Pavel Emelyanov	ed3ce0f6af	distributed_loader: Don't create owned ranges if skip-cleanup is true In order to make reshard compaction task run cleanup, the owner-ranges pointer is passed to it. If it's nullptr, the cleanup is not performed. So to do the skip-cleanup, the easiest (but not the most apparent) way is not to initialize the pointer and keep it nullptr. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-13 16:52:15 +03:00
Pavel Emelyanov	4ab049ac8d	code: Push bool skip_cleanup flag around Just put the boolean into the callstack between API and distributed loader to reduce the churn in the next patches. No functional changes, flag is false and unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-13 16:51:21 +03:00
Dawid Mędrek	9ebd6df43a	locator/production_snitch_base: Reduce log level when property file incomplete We're reducing the log level in case the provided property file is incomplete. The rationale behind this change is related to how CCM interacts with Scylla: * The `GossipingPropertyFileSnitch` reloads the `cassandra-rackdc.properties` configuration every 60 seconds. * When a new node is added to the cluster, CCM recreates the `cassandra-rackdc.properties` file for EVERY node. If those two processes start happening at about the same time, it may lead to Scylla trying to read a not-completely-recreated file, and an error will be produced. Although we would normally fix this issue and try to avoid the race, that behavior will be no longer relevant as we're making the rack and DC values immutable (cf. scylladb/scylladb#23278). What's more, trying to fix the problem in the older versions of Scylla could bring a more serious regression. Having that in mind, this commit is a compromise between making CI less flaky and having minimal impact when backported. We do the same for when the format of the file is invalid: the rationale is the same. We also do that for when there is a double declaration. Although it seems impossible that this can stem from the same scenario the other two errors can (since if the format of the file is valid, the error is justified; if the format is invalid, it should be detected sooner than a doubled declaration), let's stay consistent with the logging level. Fixes scylladb/scylladb#20092 Closes scylladb/scylladb#23956	2025-05-13 13:59:39 +03:00
Andrei Chekun	c33c0d62e1	test.py: change pattern for cleaning .log files in testlog directory Currently, test.py will delete recursively all .log files under the testlog directory instead of cleaning only on testlog directory. With this change it will not go deeper to delete log files. We still have a method for cleaning the log files in modes directories. The downside of this solution, that we will need to explicitly tell all directories that we want to clean. Fixes: https://github.com/scylladb/scylladb/issues/24001 Closes scylladb/scylladb#24004	2025-05-13 13:58:36 +03:00
Anna Stuchlik	eed8373b77	doc: remove the redundant pages This commit removes two redundant pages and adds the related redirections. - The Tutorials page is a duplicate and is not maintained anymore. Having it in the docs hurts the SEO of the up-to-date Tutorias page. - The Contributing page is not helpful. Contributions-related information should be maintained in the project README file. Fixes https://github.com/scylladb/scylladb/issues/17279 Fixes https://github.com/scylladb/scylladb/issues/24060 Closes scylladb/scylladb#24090	2025-05-13 13:29:04 +03:00
Andrei Chekun	747f2b1301	docs: add more steps in installation of test.py Documentation for --gather-metric parameter was missing. This functionality can break regular flow of using test.py, because of possible misconfiguration of the cgroup on the local machine. Added explanation how to deal with potential issue of gathering metrics functionality and how to switch it off. Fixes: https://github.com/scylladb/scylladb/issues/20763 Closes scylladb/scylladb#24095	2025-05-13 13:08:18 +03:00
Ernest Zaslavsky	2d5c0f0cfd	encryption_test: Catch exact exception Apparently `test_kms_network_error` will succeed at any circumstances since most of our exceptions derive from `std::exception`, so whatever happens to the test, for whatever reason it will throw, the test will be marked as passed. Start catching the exact exception that we expect to be thrown. Closes scylladb/scylladb#24065	2025-05-13 12:55:19 +03:00
Ernest Zaslavsky	4a7c847cba	database_test: Wait for the index to be created Just call `wait_until_built` for the index in question fix: https://github.com/scylladb/scylladb/issues/24059 Closes scylladb/scylladb#24117	2025-05-13 11:40:55 +03:00
Petr Gusev	f245b05022	raft_sys_table_storage: avoid temporary buffer when deserializing log_entry The get_blob() method linearizes data by copying it into a single buffer, which can trigger "oversized allocation" warnings. This commit avoids that extra copy by creating an input stream directly over the original fragmented managed bytes returned by untyped_result_set_row::get_view(). Fixes scylladb/scylladb#23903	2025-05-13 10:33:57 +02:00
Petr Gusev	6496ae6573	serializer_impl.hh: add as_input_stream(managed_bytes_view) overload It's useful to have it here so that people can find it easily.	2025-05-13 10:32:32 +02:00
Wojciech Mitros	bceb64fb5a	test_mv_tablets_replace: wait for tablet replicas to balance before working on them In the test test_tablet_mv_replica_pairing_during_replace we stop 2 out of 4 servers while using RF=2. Even though in the test we use exactly 4 tablets (1 for each replica of a base table and view), intially, the tablets may not be split evenly between all nodes. Because of this, even when we chose a server that hosts the view and a different server that hosts the base table, we sometimes stoped all replicas of the base or the view table because the node with the base table replica may also be a view replica. After some time, the tablets should be distributed across all nodes. When that happens, there will be no common nodes with a base and view replica, so the test scenario will continue as planned. In this patch, we add this waiting period after creating the base and view, and continue the test only when all 4 tablets are on distinct nodes. Fixes https://github.com/scylladb/scylladb/issues/23982 Fixes https://github.com/scylladb/scylladb/issues/23997 Closes scylladb/scylladb#24111	2025-05-12 16:17:48 +02:00
Nadav Har'El	248688473d	build: when compiling without -g, don't leave debugging information If Scylla is compiled without "-g" (this is, for example, the default in dev build mode), any static library that we link with it and contains any debugging information will cause the resulting executable to incorrectly look (e.g., to file(1) or to gdb) like it has debugging information. For more than three years now (see #10863 for historical context), the wasmtime.a library, which has debugging symbols, has caused this to happen. In this patch, if a certain build is compiled WITHOUT "-g", we add the "--strip-debug" option to the linker to remove the partial debugging information from the executable. Note that --strip-debug is not added in build modes which do use "-g", or if the user explicitly asked to add -g (e.g., "configure.py --cflags=-g"). Before this patch: $ file build/dev/scylla build/dev/scylla: ELF 64-bit LSB executable ... , with debug_info, not stripped Ater this patch: $ file build/dev/scylla build/dev/scylla: ELF 64-bit LSB executable ... , not stripped Fixes #23832. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23840	2025-05-12 15:42:17 +03:00
Ujjawal Kumar	35cd200789	ent/encryption/kms_host.cc: Change regex pattern to include hyphens in AWS profile names. Fixes #22430 Closes scylladb/scylladb#23805	2025-05-12 15:41:00 +03:00
Botond Dénes	746382257c	Merge 'compress: fix an internal error when a specific debug log is enabled' from Michał Chojnowski compress: fix an internal error when a specific debug log is enabled While iterating over the recent `69684e16d8`, series I shot myself in the foot by defining `algorithm_to_name(algorithm::none)` to be an internal error, and later calling that anyway in a debug log. (Tests didn't catch it because there's no test which simultaneously enables the debug log and configures some table to have no compression). This proves that `algorithm_to_name` is too much of a footgun. Fix it so that calling `algorithm_to_name(algorithm::none)` is legal. In hindsight, I should have done that immediately. Fixes #23624 Fix for recently-added code, no backporting needed. Closes scylladb/scylladb#23625 * github.com:scylladb/scylladb: test_sstable_compression_dictionaries: reproduce an internal error in debug logging compress: fix an internal error when a specific debug log is enabled	2025-05-12 15:40:12 +03:00
Calle Wilund	b28413890b	encryption_at_rest_test: Add test cases for bad KMIP config on reboot Refs scylladb/scylla-enterprise#5321 Adds two small test cases, for slight variations on KMIP host config being missing when rebooting a node, and table/sstable resolution failing due to this. Mainly to verify that we fail as expected, without crashing. Closes scylladb/scylladb#23544	2025-05-12 15:39:05 +03:00
Nadav Har'El	7c24e09b0d	test/alternator: add some Alternator-over-HTTPS tests This patch adds a few tests for Alternator over HTTPS (encrypted HTTP, a.k.a. TLS or SSL). The tests are skipped unless run with "--https", so they will not be run in CI. Nevertheless, they are useful to improve our understanding on how DynamoDB works over HTTPS and can be a basis for adding more tests for HTTPS support. The included tests pass on both Alternator and AWS DynamoDB. One test checks that both TLS 1.2 and TLS 1.3 are properly supported, and if chosen by the client, are actually honored. The same test also checks that TLS 1.1 is not supported, and results with a proper error if attempted. Both AWS DynamoDB and Alterator support the same protocols. Another test verifies that HTTP (unencrypted) requests cannot be sent over an HTTPS port. This is important for security - an installation that chooses to allow only HTTPS wants users to only use encrypted connections, and would not want users to continue sending unencrypted requests to the HTTPS port. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23493	2025-05-12 15:38:33 +03:00
Kefu Chai	8320d703cd	scripts/open-coredump.sh: Add substitute-path hint in prompt message Add a substitute-path rule hint in the greeting message displayed before launching dbuild. This helps developers debug coredumps by correctly mapping source files. Background: - Scylla's Jenkins builds typically occur in /jenkins/workspace/scylla-${branch}/next - When debugging locally, source paths need remapping to match the build environment - The substitute-path rule allows GDB to locate source files correctly This change improves developer experience by providing the appropriate path substitution command directly in the prompt. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23038	2025-05-12 15:37:59 +03:00
Kefu Chai	46f7ff6cfc	docs: nodetool: reference "nodetool task" page * Rewrite the documentation for the "nodetool restore" command. * Clarify the relationship between the `--nowait` flag and asynchronous operation. * Reference the "nodetool task" page for managing background tasks. Fixes scylladb#21888 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22023	2025-05-12 15:37:22 +03:00
Botond Dénes	dff7e2fc2f	Merge 'gossiper: failure_detector_loop_for_node: abort send_gossip_echo using abort_source' from Benny Halevy Currently send_gossip_echo has a 22 seconds timeout during which _abort_source is ignored. Use a function-local abort_source to abort send_gossip_echo either on timeout or if _abort_source requested abort, and co_return in the latter case. Closes scylladb/scylladb#12296 * github.com:scylladb/scylladb: gossiper: make send_gossip_echo cancellable gossiper: add send_echo helper idl, message: make with_timeout and cancellable verb attributes composable gossiper: failure_detector_loop_for_node: ignore abort_requested_exception gossiper: failure_detector_loop_for_node: check if abort_requested in loop condition	2025-05-12 15:35:30 +03:00
Pavel Emelyanov	5bd3df507e	sstables: Lazily access statistics for trace-level logging There's a message in sstable::get_gc_before_for_fully_expire() method that is trace-level and one of its argument finds a value in sstable statisitics. Finding the value is not quite cheap (makes a lookup in std::unordered_map) and for mostly-off trace messages is just a waste of cycles. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23910	2025-05-12 11:22:31 +03:00
Patryk Jędrzejczak	4d0538eecb	Merge 'test/cluster: Adjust tests to RF-rack-valid keyspaces' from Dawid Mędrek In this PR, we're adjusting most of the cluster tests so that they pass with the `rf_rack_valid_keyspaces` configuration option enabled. In most cases, the changes are straightforward and require little to no additional insight into what the tests are doing or verifying. In some, however, doing that does require a deeper understanding of the tests we're modifying. The justification for those changes and their correctness is included in the commit messages corresponding to them. Note that this PR does not cover all of the cluster tests. There are few remaining ones, but they require a bit more effort, so we delegate that work to a separate PR. I tested all of the modified tests locally with `rf_rack_valid_keyspaces` set to true, and they all passed. Fixes scylladb/scylladb#23959 Backport: we want to backport these changes to 2025.1 since that's the version where we introduced RF-rack-valid keyspaces in. Although the tests are not, by default, run with `rf_rack_valid_keyspaces` enabled yet, that will most likely change in the near future and we'll also want to backport those changes too. The reason for this is that we want to verify that Scylla works correctly even with that constraint. Closes scylladb/scylladb#23661 * https://github.com/scylladb/scylladb: test/cluster/suite.yaml: Enable rf_rack_valid_keyspaces in suite test/cluster: Disable rf_rack_valid_keyspaces in problematic tests test/cluster/test_tablets: Divide rack into two to adjust tests to RF-rack-validity test/cluster/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity test/cluster/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair test/cluster/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity test/cluster/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity test/cluster/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity test/cluster/test_not_enough_token_owners.py: Adjust to RF-rack-validity test/cluster/test_multidc.py: Adjust to RF-rack-validity test/cluster/object_store/test_backup.py: Adjust to RF-rack-validity test/cluster: Adjust simple tests to RF-rack-validity	2025-05-12 09:41:07 +02:00
Aleksandra Martyniuk	2dcea5a27d	streaming: use host_id in file streaming Use host ids instead of ips in file-streaming. Fixes: #22421. Closes scylladb/scylladb#24055	2025-05-12 09:36:48 +03:00
Łukasz Paszkowski	113647550f	tools/scylla-nodetool: fix crash when rows_merged cells contain null Any empty object of the json::json_list type has its internal _set variable assigned to false which results in such objects being skipped by the json::json_builder. Hence, the json returned by the api GET//compaction_manager/compaction_history does not contain the field `rows_merged` if a cell in the system.compaction_history table is null or an empty list. In such cases, executing the command `nodetool compactionhistory` will result in a crash with the following error message: `error running operation: rjson::error (JSON assert failed on condition 'false'` The patch fixes it by checking if the json object contains the `rows_merged` element before processing. If the element does not exist, the nodetool will now produce an empty list. Fixes https://github.com/scylladb/scylladb/issues/23540 Closes scylladb/scylladb#23514	2025-05-12 09:00:48 +03:00
Avi Kivity	5e764d1de2	Merge 'Drop v2 and flat from reader and related names' from Botond Dénes Following a number of similar code cleanup PR, this one aims to be the last one, definitely dropping flat from all reader and related names. Similarly, v2 is also dropped from reader names, although it still persists in mutation_fragment_v2, mutation_v2 and related names. This won't change in the foreseeable future, as we don't have plans to drop mutation (the v1 variant). The changes in this PR are entirely mechanical, mostly just search-and-replace. Code cleanup, no backport required. Closes scylladb/scylladb#24087 * github.com:scylladb/scylladb: test/boost/mutation_reader_another_test: drop v2 from reader and related names test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/ test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/ test/boost/mutation_test: s/consumer_v2/consumer/ test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/ readers/mutation_readers: s/generating_reader_v2/generating_reader/ readers/mutation_readers: s/delegating_reader_v2/delegating_reader/ readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/ readers/mutation_source: s/make_reader_v2/make_mutation_reader/ readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/ readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/ mutation/mutation_compactor: drop v2 from compactor and related names replica/table: s/make_reader_v2/make_mutation_reader/ mutation_writer: s/bucket_writer_v2/bucket_writer/ readers/queue: drop v2 from reader and related names readers/multishard: drop v2 from reader and related names readers/evictable: drop v2 from reader and related names readers/multi_range: remove flat from name	2025-05-11 22:22:35 +03:00
Botond Dénes	3ba5dd79e6	tools/scylla-nodetool: document exit codes in --help Closes scylladb/scylladb#24054	2025-05-11 22:18:29 +03:00
Dawid Mędrek	ee96f8dcfc	test/cluster/suite.yaml: Enable rf_rack_valid_keyspaces in suite Almost all of the tests have been adjusted to be able to be run with the `rf_rack_valid_keyspaces` configuration option enabled, while the rest, a minority, create nodes with it disabled. Thanks to that, we can enable it by default, so let's do that.	2025-05-10 16:30:51 +02:00
Dawid Mędrek	c4b32c38a3	test/cluster: Disable rf_rack_valid_keyspaces in problematic tests Some of the tests in the test suite have proven to be more problematic in adjusting to RF-rack-validity. Since we'd like to run as many tests as possible with the `rf_rack_valid_keyspaces` configuration option enabled, let's disable it in those. In the following commit, we'll enable it by default.	2025-05-10 16:30:49 +02:00
Dawid Mędrek	c8c28dae92	test/cluster/test_tablets: Divide rack into two to adjust tests to RF-rack-validity Three tests in the file use a multi-DC cluster. Unfortunately, they put all of the nodes in a DC in the same rack and because of that, they fail when run with the `rf_rack_valid_keyspaces` configuration option enabled. Since the tests revolve mostly around zero-token nodes and how they affect replication in a keyspace, this change should have zero impact on them.	2025-05-10 16:30:46 +02:00
Dawid Mędrek	04567c28a3	test/cluster/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity We reduce the number of nodes and the RF values used in the test to make sure that the test can be run with the `rf_rack_valid_keyspaces` configuration option. The test doesn't seem to be reliant on the exact number of nodes, so the reduction should not make any difference.	2025-05-10 16:30:43 +02:00
Dawid Mędrek	d3c0cd6d9d	test/cluster/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity The change boils down to matching the number of created racks to the number of created nodes in each DC in the auxiliary function `prepare_multi_dc_repair`. This way, we ensure that the created keyspace will be RF-rack-valid and so we can run the test file even with the `rf_rack_valid_keyspaces` configuration option enabled. The change has no impact on the tests that use the function; the distribution of nodes across racks does not affect how repair is performed or what the tests do and verify. Because of that, the change is correct.	2025-05-10 16:30:40 +02:00
Dawid Mędrek	5d1bb8ebc5	test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair We assign the newly created nodes to multiple racks. If RF <= 3, we create as many racks as the provided RF. We disallow the case of RF > 3 to avoid trying to create an RF-rack-invalid keyspace; note that no existing test calls `create_table_insert_data_for_repair` providing a higher RF. The rationale for doing this is we want to ensure that the tests calling the function can be run with the `rf_rack_valid_keyspaces` configuration option enabled.	2025-05-10 16:30:37 +02:00
Dawid Mędrek	92f7d5bf10	test/cluster/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity We assign the nodes to the same DC, but multiple racks to ensure that the created keyspace is RF-rack-valid and we can run the test with the `rf_rack_valid_keyspaces` configuration option enabled. The changes do not affect what the test does and verifies.	2025-05-10 16:30:34 +02:00
Dawid Mędrek	4c46551c6b	test/cluster/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity We simply assign the nodes used in the test to seprate racks to ensure that the created keyspace is RF-rack-valid to be able to run the test with the `rf_rack_valid_keyspaces` configuration option set to true. The change does not affect what the test does and verifies -- it only depends on the type of nodes, whether they are normal token owners or not -- and so the changes are correct in that sense.	2025-05-10 16:30:31 +02:00
Dawid Mędrek	2882b7e48a	test/cluster/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity We parameterize the test so it's run with and without enforced RF-rack-valid keyspaces. In the test itself, we introduce a branch to make sure that we won't run into a situation where we're attempting to create an RF-rack-invalid keyspace. Since the `rf_rack_valid_keyspaces` option is not commonly used yet and because its semantics will most likely change in the future, we decide to parameterize the test rather than try to get rid of some of the test cases that are problematic with the option enabled.	2025-05-10 16:30:29 +02:00
Dawid Mędrek	73b22d4f6b	test/cluster/test_not_enough_token_owners.py: Adjust to RF-rack-validity We simply assign DC/rack properties to every node used in the test. We put all of them in the same DC to make sure that the cluster behaves as closely to how it would before these changes. However, we distribute them over multiple racks to ensure that the keyspace used in the test is RF-rack-valid, so we can also run it with the `rf_rack_valid_keyspaces` configuration option set to true. The distribution of nodes between racks has no effect on what the test does and verifies, so the changes are correct in that sense.	2025-05-10 16:30:26 +02:00
Dawid Mędrek	5b83304b38	test/cluster/test_multidc.py: Adjust to RF-rack-validity Instead of putting all of the nodes in a DC in the same rack in `test_putget_2dc_with_rf`, we assign them to different racks. The distribution of nodes in racks is orthogonal to what the test is doing and verifying, so the change is correct in that sense. At the same time, it ensures that the test never violates the invariant of RF-rack-valid keyspaces, so we can also run it with `rf_rack_valid_keyspaces` set to true.	2025-05-10 16:30:23 +02:00
Dawid Mędrek	9281bff0e3	test/cluster/object_store/test_backup.py: Adjust to RF-rack-validity We modify the parameters of `test_restore_with_streaming_scopes` so that it now represents a pair of values: topology layout and the value `rf_rack_valid_keyspaces` should be set to. Two of the already existing parameters violate RF-rack-validity and so the test would fail when run with `rf_rack_valid_keyspaces: true`. However, since the option isn't commonly used yet and since the semantics of RF-rack-valid keyspaces will most likely change in the future, let's keep those cases and just run them with the option disabled. This way, we still test everything we can without running into undesired failures that don't indicate anything.	2025-05-10 16:30:20 +02:00
Dawid Mędrek	dbb8835fdf	test/cluster: Adjust simple tests to RF-rack-validity We adjust all of the simple cases of cluster tests so they work with `rf_rack_valid_keyspaces: true`. It boils down to assigning nodes to multiple racks. For most of the changes, we do that by: * Using `pytest.mark.prepare_3_racks_cluster` instead of `pytest.mark.prepare_3_nodes_cluster`. * Using an additional argument -- `auto_rack_dc` -- when calling `ManagerClient::servers_add()`. In some cases, we need to assign the racks manually, which may be less obvious, but in every such situation, the tests didn't rely on that assignment, so that doesn't affect them or what they verify.	2025-05-10 16:30:18 +02:00
Botond Dénes	911aa64043	test/boost/mutation_reader_another_test: drop v2 from reader and related names For the test case test_mutation_reader_from_mutations_as_mutation_source, the v1/v2 distinction was hiding two identical test cases. One was removed.	2025-05-09 07:53:30 -04:00
Botond Dénes	466a8a2b64	test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/	2025-05-09 07:53:30 -04:00
Botond Dénes	30625a6ef7	test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/	2025-05-09 07:53:30 -04:00
Botond Dénes	1169ac6ac8	test/boost/mutation_test: s/consumer_v2/consumer/	2025-05-09 07:53:30 -04:00
Botond Dénes	17b667b116	test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/	2025-05-09 07:53:30 -04:00
Botond Dénes	5dd546ea2b	readers/mutation_readers: s/generating_reader_v2/generating_reader/	2025-05-09 07:53:30 -04:00
Botond Dénes	75fddbc078	readers/mutation_readers: s/delegating_reader_v2/delegating_reader/	2025-05-09 07:53:30 -04:00
Botond Dénes	2fc3e52b2b	readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/	2025-05-09 07:53:29 -04:00
Botond Dénes	674d41e3e6	readers/mutation_source: s/make_reader_v2/make_mutation_reader/	2025-05-09 07:53:29 -04:00
Botond Dénes	327867aa8a	readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/	2025-05-09 07:53:29 -04:00
Botond Dénes	efc48caea5	readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/	2025-05-09 07:53:29 -04:00
Botond Dénes	7af0690762	mutation/mutation_compactor: drop v2 from compactor and related names	2025-05-09 07:53:29 -04:00
Botond Dénes	b5170e27d0	replica/table: s/make_reader_v2/make_mutation_reader/	2025-05-09 07:53:29 -04:00
Botond Dénes	cc95dc8756	mutation_writer: s/bucket_writer_v2/bucket_writer/	2025-05-09 07:53:29 -04:00
Botond Dénes	3d2651e07c	readers/queue: drop v2 from reader and related names	2025-05-09 07:53:29 -04:00
Botond Dénes	ca7f557e86	readers/multishard: drop v2 from reader and related names	2025-05-09 07:53:29 -04:00
Botond Dénes	4d92bc8b2f	readers/evictable: drop v2 from reader and related names	2025-05-09 07:53:28 -04:00
Botond Dénes	7ba3c3fec3	readers/multi_range: remove flat from name	2025-05-09 07:53:25 -04:00
Avi Kivity	092a88c9b9	dist: drop the scylla-env package scylla-env was used to glue together support for older distributions. It hasn't been used for many years. Remove it. Closes scylladb/scylladb#23985	2025-05-09 14:10:00 +03:00
Raphael S. Carvalho	28056344ba	replica: Fix take_storage_snapshot() running concurrently to merge completion Some background: When merge happens, a background fiber wakes up to merge compaction groups of sibling tablets into main one. It cannot happen when rebuilding the storage group list, since token metadata update is not preemptable. So a storage group, post merge, has the main compaction group and two other groups to be merged into the main. When the merge happens, those two groups are empty and will be freed. Consider this scenario: 1) merge happens, from 2 to 1 tablet 2) produces a single storage group, containing main and two other compaction groups to be merged into main. 3) take_storage_snapshot(), triggered by migration post merge, gets a list of pointer to all compaction groups. 4) t__s__s() iterates first on main group, yields. 5) background fiber wakes up, moves the data into main and frees the two groups 6) t__s__s() advances to other groups that are now freed, since step 5. 7) segmentation fault In addition to memory corruption, there's also a potential for data to escape the iteration in take_storage_snapshot(), since data can be moved across compaction groups in background, all belonging to the same storage group. That could result in data loss. Readers should all operate on storage group level since it can provide a view on all the data owned by a tablet replica. The movement of sstable from group A to B is atomic, but iteration first on A, then later on B, might miss data that was moved from B to A, before the iteration reached B. By switching to storage group in the interface that retrieves groups by token range, we guarantee that all data of a given replica can be found regardless of which compaction group they sit on. Fixes #23162. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24058	2025-05-09 14:07:06 +03:00
Gleb Natapov	c6e1758457	topology coordinator: make decommissioning node non voter before completing the operation A decommissioned node is removed from a raft config after operation is marked as completed. This is required since otherwise the decommissioned node will not see that decommission has completed (the status is propagated through raft). But right after the decommission is marked as completed a decommissioned node may terminate, so in case of a two node cluster, the configuration change that removes it from the raft will fail, because there will no be quorum. The solution is to mark the decommissioning node as non voter before reporting the operation as completed. Fixes: #24026 Backport to 2025.2 because it fixes a potential hang. Don't backport to branches older than 2025.2 because they don't have `8b186ab0ff`, which caused this issue. Closes scylladb/scylladb#24027	2025-05-09 12:43:31 +02:00
Tomasz Grabiec	be2c3ad6fd	Merge 'logalloc_test: don't test performance in test background_reclaim' from Michał Chojnowski The test is failing in CI sometimes due to performance reasons. There are at least two problems: 1. The initial 500ms (wall time) sleep might be too short. If the reclaimer doesn't manage to evict enough memory during this time, the test will fail. 2. During the 100ms (thread CPU time) window given by the test to background reclaim, the `background_reclaim` scheduling group isn't actually guaranteed to get any CPU, regardless of shares. If the process is switched out inside the `background_reclaim` group, it might accumulate so much vruntime that it won't get any more CPU again for a long time. We have seen both. This kind of timing test can't be run reliably on overcommitted machines without modifying the Seastar scheduler to support that (by e.g. using thread clock instead of wall time clock in the scheduler), and that would require an amount of effort disproportionate to the value of the test. So for now, to unflake the test, this patch removes the performance test part. (And the tradeoff is a weakening of the test). After the patch, we only check that the background reclaim happens eventually. Fixes https://github.com/scylladb/scylladb/issues/15677 Backporting this is optional. The test is flaky even in stable branches, but the failure is rare. Closes scylladb/scylladb#24030 * github.com:scylladb/scylladb: logalloc_test: don't test performance in test `background_reclaim` logalloc: make background_reclaimer::free_memory_threshold publicly visible	2025-05-09 11:35:02 +02:00
Patryk Jędrzejczak	be4532bcec	Merge 'Correctly skip updating node's own ip address due to oudated gossiper data ' from Gleb Natapov Used host id to check if the update is for the node itself. Using IP is unreliable since if a node is restarted with different IP a gossiper message with previous IP can be misinterpreted as belonging to a different node. Fixes: #22777 Backport to 2025.1 since this fixes a crash. Older version do not have the code. Closes scylladb/scylladb#24000 * https://github.com/scylladb/scylladb: test: add reproducer for #22777 storage_service: Do not remove gossiper entry on address change storage_service: use id to check for local node	2025-05-09 11:28:21 +02:00
Andrzej Jackowski	f53d733e89	docs: lwt: add two missing spaces Due to lack of spaces, two example queries were not displayed in the rendered version of the document. In result, the `SELECT * FROM movies.nowshowing;` query in the step 6. returned 6 rows instead of expected 8 rows.	2025-05-09 08:42:15 +02:00
Piotr Smaron	f740f9f0e1	cql: fix CREATE tablets KS warning msg Materialized Views and Secondary Indexes are yet another features that keyspaces with tablets do not support, but these were not listed in a warning message returned to the user on CREATE KEYSPACE statement. This commit adds the 2 missing features. Fixes: #24006 Closes scylladb/scylladb#23902	2025-05-08 17:18:43 +02:00
Tomasz Grabiec	fadfbe8459	Merge 'transport: storage_proxy: release ERM when waiting for query timeout' from Andrzej Jackowski Before this change, if a read executor had just enough targets to achieve query's CL, and there was a connection drop (e.g. node failure), the read executor waited for the entire request timeout to give drivers time to execute a speculative read in a meantime. Such behavior don't work well when a very long query timeout (e.g. 1800s) is set, because the unfinished request blocks topology changes. This change implements a mechanism to thrown a new read_failure_exception_with_timeout in the aforementioned scenario. The exception is caught by CQL server which conducts the waiting, after ERM is released. The new exception inherits from read_failure_exception, because layers that don't catch the exception (such as mapreduce service) should handle the exception just a regular read_failure. However, when CQL server catch the exception, it returns read_timeout_exception to the client because after additional waiting such an error message is more appropriate (read_timeout_exception was also returned before this change was introduced). This change: - Rewrite cql_server::connection::process_request_one to use seastar::futurize_invoke and try_catch<> instead of utils::result_try - Add new read_failure_exception_with_timeout and throws it in storage_proxy - Add sleep in CQL server when the new exception is caught - Catch local exceptions in Mapreduce Service and convert them to std::runtime_error. - Add get_cql_exclusive to manager_client.py - Add test_long_query_timeout_erm No backport needed - minor issue fix. Closes scylladb/scylladb#23156 * github.com:scylladb/scylladb: test: add test_long_query_timeout_erm test: add get_cql_exclusive to manager_client.py mapreduce: catch local read_failure_exception_with_timeout transport: storage_proxy: release ERM when waiting for query timeout transport: remove redundant references in process_request_one transport: fix the indentation in process_request_one transport: add futures in CQL server exception handling	2025-05-08 12:45:49 +02:00
Avi Kivity	2d2a2ef277	tools: toolchain: dbuild: support nested containers Pass through the local containers directory (it cannot be bind-mounted to /var/lib/containers since podman checks the path hasn't changed) with overrides to the paths. This allows containers to be created inside the dbuild container, so we can enlist pre-packaged software (such as opensearch) in test.py. If the container images are already downloaded in the host, they won't be downloaded again. It turns out that the container ecosystem doesn't support nested network namespaces well, so we configure the outer container to use host networking for the inner containers. It's useful anyway. The frozen toolchain now installs podman and buildah so there's something to actually drive those nested containers. We disable weak dnf dependencies to avoid installing qemu. The frozen toolchain is regenerated with optimized clang from https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz Closes scylladb/scylladb#24020	2025-05-08 13:00:16 +03:00
Botond Dénes	4a802baccb	Merge 'compress: make sstable compression dictionaries NUMA-aware ' from Michał Chojnowski compress: distribute compression dictionaries over shards We don't want each shard to have its own copy of each dictionary. It would unnecessary pressure on cache and memory. Instead, we want to share dictionaries between shards. Before this commit, all dictionaries live on shard 0. All other shards borrow foreign shared pointers from shard 0. There's a problem with this setup: dictionary blobs receive many random accesses. If shard 0 is on a remote NUMA node, this could pose a performance problem. Therefore, for each dictionary, we would like to have one copy per NUMA node, not one copy per the entire machine. And each shard should use the copy belonging to its own NUMA node. This is the main goal of this patch. There is another issue with putting all dicts on shard 0: it eats an assymetric amount of memory from shard 0. This commit spreads the ownership of dicts over all shards within the NUMA group, to make the situation more symmetric. (Dict owner is decided based on the hash of dict contents). It should be noted that the last part isn't necessarily a good thing, though. While it makes the situation more symmetric within each node, it makes it less symmetric across the cluster, if different node sizes are present. If dicts occupy 1% of memory on each shard of a 100-shard node, then the same dicts would occupy 100% of memory on a 1-shard node. So for the sake of cluster-wide symmetry, we might later want to consider e.g. making the memory limit for dictionaries inversely proportional to the number of shards. New functionality, added to a feature which isn't in any stable branch yet. No backporting. Closes scylladb/scylladb#23590 * github.com:scylladb/scylladb: test: add test/boost/sstable_compressor_factory_test compress: add some test-only APIs compress: rename sstable_compressor_factory_impl to dictionary_holder compress: fix indentation compress: remove sstable_compressor_factory_impl::_owner_shard compress: distribute compression dictionaries over shards test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version test: remove sstables::test_env::do_with()	2025-05-08 09:52:46 +03:00
Botond Dénes	e5d944f986	Merge 'replica: Fix use-after-free with concurrent schema change and sstable set update' from Raphael Raph Carvalho When schema is changed, sstable set is updated according to the compaction strategy of the new schema (no changes to set are actually made, just the underlying set type is updated), but the problem is that it happens without a lock, causing a use-after-free when running concurrently to another set update. Example: 1) A: sstable set is being updated on compaction completion 2) B: schema change updates the set (it's non deferring, so it happens in one go) and frees the set used by A. 3) when A resumes, system will likely crash since the set is freed already. ASAN screams about it: SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ... Fix is about deferring update of the set on schema change to compaction, which is triggered after new schema is set. Only strategy state and backlog tracker are updated immediately, which is fine since strategy doesn't depend on any particular implementation of sstable set. Fixes #22040. Closes scylladb/scylladb#23680 * github.com:scylladb/scylladb: replica: Fix use-after-free with concurrent schema change and sstable set update sstables: Implement sstable_set_impl::all_sstable_runs()	2025-05-08 06:56:16 +03:00
Petr Gusev	e6c3f954f6	main: check if current process group controls stdin tty test.py doesn't override stdin when starting Scylla, so when tests are run from a terminal, isatty() returns true and parsed command line output is not printed, which is inconvenient. In this commit we add a check if the current process group controls the stdin terminal. This serves two purposes: * improves the "interactive mode" check from #scylladb/scylladb#18309, as only the controlling process group can interact with the terminal. * solves the test.py problem above, because test.py runs scylla in a new session/process group (it calls setsid after fork), and is now correctly not considered interactive. Closes scylladb/scylladb#24047	2025-05-08 06:52:48 +03:00
Michał Chojnowski	746ec1d4e4	test/boost/mvcc_test: fix an overly-strong assertion in test_snapshot_cursor_is_consistent_with_merging The test checks that merging the partition versions on-the-fly using the cursor gives the same results as merging them destructively with apply_monotonically. In particular, it tests that the continuity of both results is equal. However, there's a subtlety which makes this not true. The cursor puts empty dummy rows (i.e. dummies shadowed by the partition tombstone) in the output. But the destructive merge is allowed (as an expection to the general rule, for optimization reasons), to remove those dummies and thus reduce the continuity. So after this patch we instead check that the output of the cursor has continuity equal to the merged continuities of version. (Rather than to the continuity of merged versions, which can be smaller as described above). Refs https://github.com/scylladb/scylladb/pull/21459, a patch which did the same in a different test. Fixes https://github.com/scylladb/scylladb/issues/13642 Closes scylladb/scylladb#24044	2025-05-08 00:41:01 +02:00
Pavel Emelyanov	0a9675de01	sstable: Use fmt::to_string(sstable::filename()) to get component file path The stream sink abort() method wants to remove component file by its path. For that the path is calculated from storage prefix and component basename, but there's a filename() method for it already. SStable filenames shouldn't be considered as on-disk paths (see #23194), but places that want it should be explicit and format the filename to string by hand. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24039	2025-05-07 22:25:58 +03:00
Pavel Emelyanov	36baeaeb57	sstable: Move update_info_for_opened_data() method to private: block The method is internally called by ssatble itself to refresh its state after opening or assigning (from foreign info) data and index files. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24041	2025-05-07 20:58:34 +03:00
Pavel Emelyanov	c2ecc45db8	sstable: Remove validate argument from sstable::load_metadata() There are only two callers of the method and the one that wants validation (the sstable::load()) can do it on its own. This helps the other caller (schema loader) being simpler and shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24038	2025-05-07 20:57:37 +03:00
Michał Chojnowski	f075674ebe	test: add test/boost/sstable_compressor_factory_test Add a basic test for NUMA awareness of `default_sstable_compressor_factory`.	2025-05-07 14:43:20 +02:00
Michał Chojnowski	518f04f1c4	compress: add some test-only APIs Will be needed by the test added in the next patch.	2025-05-07 14:43:20 +02:00
Michał Chojnowski	66a454f61d	compress: rename sstable_compressor_factory_impl to dictionary_holder Since sstable_compressor_factory_impl no longer implements sstable_compressor_factory, the name can be misleading. Rename it to something closer to its new role.	2025-05-07 14:43:20 +02:00
Michał Chojnowski	e952992560	compress: fix indentation Purely cosmetic.	2025-05-07 14:43:20 +02:00
Michał Chojnowski	6b831aaf1b	compress: remove sstable_compressor_factory_impl::_owner_shard Before the series, sstable_compressor_factory_impl was directly accessed by multiple shards. Now, it's a part of a `sharded` data structure and is never directly from other shards, so there's no need to check for that. Remove the leftover logic.	2025-05-07 14:43:20 +02:00
Michał Chojnowski	1bcf77951c	compress: distribute compression dictionaries over shards We don't want each shard to have its own copy of each dictionary. It would unnecessary pressure on cache and memory. Instead, we want to share dictionaries between shards. Before this commit, all dictionaries live on shard 0. All other shards borrow foreign shared pointers from shard 0. There's a problem with this setup: dictionary blobs receive many random accesses. If shard 0 is on a remote NUMA node, this could pose a performance problem. Therefore, for each dictionary, we would like to have one copy per NUMA node, not one copy per the entire machine. And each shard should use the copy belonging to its own NUMA node. This is the main goal of this patch. There is another issue with putting all dicts on shard 0: it eats an assymetric amount of memory from shard 0. This commit spreads the ownership of dicts over all shards within the NUMA group, to make the situation more symmetric. (Dict owner is decided based on the hash of dict contents). It should be noted that the last part isn't necessarily a good thing, though. While it makes the situation more symmetric within each node, it makes it less symmetric across the cluster, if different node sizes are present. If dicts occupy 1% of memory on each shard of a 100-shard node, then the same dicts would occupy 100% of memory on a 1-shard node. So for the sake of cluster-wide symmetry, we might later want to consider e.g. making the memory limit for dictionaries inversely proportional to the number of shards.	2025-05-07 14:43:18 +02:00
Michał Chojnowski	8649adafa8	test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version In next patches, make_sstable_compressor_factory() will have to disappear. In preparation for that, we switch to a seastar::thread-dependent replacement.	2025-05-07 14:43:04 +02:00
Aleksandra Martyniuk	2549f5e16b	test_tablet_repair_hosts_filter: change injected error test_tablet_repair_hosts_filter checks whether the host filter specfied for tablet repair is correctly persisted. To check this, we need to ensure that the repair is still ongoing and its data is kept. The test achieves that by failing the repair on replica side - as the failed repair is going to be retried. However, if the filter does not contain any host (included_host_count = 0), the repair is started on no replica, so the request succeeds and its data is deleted. The test fails if it checks the filter after repair request data is removed. Fail repair on topology coordinator side, so the request is ongoing regardless of the specified hosts. Fixes: #23986. Closes scylladb/scylladb#24003	2025-05-07 15:30:05 +03:00
Michał Chojnowski	0e4d0ded8d	test: remove sstables::test_env::do_with() `sstable_manager` depends on `sstable_compressor_factory&`. Currently, `test_env` obtains an implementation of this interface with the synchronous `make_sstable_compressor_factory()`. But after this patch, the only implementation of that interface `sstable_compressor_factory&` will use `sharded<...>`, so its construction will become asynchronous, and the synchronous `make_sstable_compressor_factory()` must disappear. There are several possible ways to deal with this, but I think the easiest one is to write an asynchronous replacement for `make_sstable_compressor_factory()` that will keep the same signature but will be only usable in a `seastar::thread`. All other uses of `make_sstable_compressor_factory()` outside of `test_env::do_with()` already are in seastar threads, so if we just get rid of `test_env::do_with()`, then we will be able to use that thread-dependent replacement. This is the purpose of this commit. We shouldn't be losing much.	2025-05-07 13:19:21 +02:00
Nadav Har'El	7ccf77b84f	test/alternator: another test for UpdateExpression's SET I found on StackOverflow an interesting discussion about the fact that DynamoDB's UpdateExpression documentation "recommends" to use SET instead of ADD, and the rather convoluted expression that is actually needed to emulate ADD using SET: ``` SET #count = if_not_exists(#count, :zero) + :one ``` https://stackoverflow.com/questions/14077414/dynamodb-increment-a-key-value Although we do have separate tests for the different pieces of that idiom - a SET with missing attribute or item, the if_not_exists() function, etc. - I thought it would be nice to have a dedicated test that verifies that this idiom actually works, and moreover that the more naive "SET #count = #count + :one" does NOT work if the item or the attribute are missing. Unsurprisingly, the new test passes on both Alternator and DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23963	2025-05-07 13:57:50 +03:00
Nadav Har'El	b4a9fe9928	test/alternator: another test for expression with a lot of ORs We already have a test, test_limits.py::test_deeply_nested_expression_2, which checks that in the long condition expression a<b or (a<b or (a<b or (a<b or (....)))) with more than MAX_DEPTH (=400) repeats is rejected by Alternator, as part of commit `04e5082d52` which restricted the depth of the recursive parser to prevent crashing Scylla. However, I got curious what will happen without the parentheses: a<b or a<b or a<b or a<b or ... It turns out that our parser actually parses this syntax without recursion - it's just a loop (a "*" in the Antlr alternator/expressions.g allows reading more and more ORs in a loop). So Alternator doesn't limit the length of this expression more than the length limit of 4096 bytes which we also have. We can fit 584 repeats in the above expression in 4096 bytes, and it will not be rejected even though 584 > 400. This test confirms that this is indeed the case. The test is Scylla-only because on DynamoDB, this expression is rejected because it has more than 300 "OR" operators. Scylla doesn't have this specific limit - we believe the other limitations (on total expression length, and on depth) are better for protecting Scylla. Remember that in an expression like "(((((((((((((" there is a very high recursion depth of the parser but zero operators, so counting the operators does nothing to protect Scylla. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23973	2025-05-07 13:57:18 +03:00
Piotr Dulikowski	156ff8798b	topology_coordinator: silence ERROR messages on abort When the topology coordinator is shut down while doing a long-running operation, the current operation might throw a raft::request_aborted exception. This is not a critical issue and should not be logged with ERROR verbosity level. Make sure that all the try..catch blocks in the topology coordinator which: - May try to acquire a new group0 guard in the `try` part - Have a `catch (...)` block that print an ERROR-level message ...have a pass-through `catch (raft::request_aborted&)` block which does not log the exception. Fixes: scylladb/scylladb#22649 Closes scylladb/scylladb#23962	2025-05-07 13:51:41 +03:00
Aleksandra Martyniuk	20c2d6210e	streaming: skip dropped tables Currently, stream_session::prepare throws when a table in requests or summaries is dropped. However, we do not want to fail streaming if the table is dropped. Delete table checks from stream_session::prepare. Further streaming steps can handle the dropped table and finish the streaming successfully. Fixes: #15257. Closes scylladb/scylladb#23915	2025-05-07 11:51:56 +03:00
Anna Mikhlin	73b4c35601	Update ScyllaDB version to: 2025.3.0-dev	2025-05-07 11:43:11 +03:00
Pavel Emelyanov	6389099dfb	Merge 'test/cluster/test_read_repair.py: improve trace logging test (again)' from Botond Dénes The test test_read_repair_with_trace_logging wants to test read repair with trace logging. Turns out that node restart + trace-level logging + debug mode is too much and even with 1 minute timeout, the read repair times out sometimes. Refactor the test to use injection point instead of restart. To make sure the test still tests what it supposed to test, use tracing to assert that read repair did indeed happen. Fixes: scylladb/scylladb#23968 Needs backport to 2025.1 and 6.2, both have the flaky test Closes scylladb/scylladb#23989 * github.com:scylladb/scylladb: test/cluster/test_read_repair.py: improve trace logging test (again) test/cluster: extract execute_with_tracing() into pylib/util.py	2025-05-07 10:32:45 +03:00
Botond Dénes	0a9ca52cfd	replica/database: memtable_list: save ref to memtable_table_shared_data This is passed by reference to the constructor, but a copy is saved into the _table_shared_data member. A reference to this member is passed down to all memtable readers. Because of the copy, the memtable readers save a reference to the memtable_list's member, which goes away together with the memtable_list when the storage_group is destroyed. This causes use-after-free when a storage group is destroyed while a memtable read is still ongoing. The memtable reader keeps the memtable alive, but its reference to the memtable_table_shared_data becomes stale. Fix by saving a reference in the memtable_list too, so memtable readers receive a reference pointing to the original replica::table member, which is stable accross tablet migrations and merges. The copy was introduced by `2a76065e3d`. There was a copy even before this commit, but in the previous vnode-only world this was fine -- there was one memtable_list per table and it was around until the table itself was. In the tablet world, this is no longer given, but the above commit didn't account for this. A test is included, which reproduces the use-after-free on memtable migration. The test is somewhat artificial in that the use-after-free would be prevented by holding on to an ERM, but this is done intentionaly to keep the test simple. Migration -- unlike merge where this use-after-free was originally observed -- is easy to trigger from unit tests. Fixes: #23762 Closes scylladb/scylladb#23984	2025-05-06 22:13:17 +03:00
Michał Chojnowski	1c1741cfbc	logalloc_test: don't test performance in test `background_reclaim` The test is failing in CI sometimes due to performance reasons. There are at least two problems: 1. The initial 500ms (wall time) sleep might be too short. If the reclaimer doesn't manage to evict enough memory during this time, the test will fail. 2. During the 100ms (thread CPU time) window given by the test to background reclaim, the `background_reclaim` scheduling group isn't actually guaranteed to get any CPU, regardless of shares. If the process is switched out inside the `background_reclaim` group, it might accumulate so much vruntime that it won't get any more CPU again for a long time. We have seen both. This kind of timing test can't be run reliably on overcommitted machines without modifying the Seastar scheduler to support that (by e.g. using thread clock instead of wall time clock in the scheduler), and that would require an amount of effort disproportionate to the value of the test. So for now, to unflake the test, this patch removes the performance test part. (And the tradeoff is a weakening of the test).	2025-05-06 18:59:18 +02:00
Michał Chojnowski	c47f438db3	logalloc: make background_reclaimer::free_memory_threshold publicly visible Wanted by the change to the background_reclaim test in the next patch.	2025-05-06 18:59:18 +02:00
David Garcia	b1ee0e2a6a	docs: fix AttributeError with 'myst_enable_extensions' in publication workflow Rolled back some dependencies in `poetry.lock` to previous versions while we investigate how to make the extension `sphinx_scylladb_markdown` compatible with the latest versions. This should fix the error in https://github.com/scylladb/scylladb/actions/runs/14708656912/job/41275115239, which currently prevents publishing new versions of https://opensource.docs.scylladb.com/ Closes scylladb/scylladb#23969	2025-05-06 16:33:00 +03:00
Pavel Emelyanov	1b5bbc2433	Merge 'test.py: split boost pytest integration' from Andrei Chekun This PR contains changes that do not add new functionality, and have small refactoring of the existing code. The most significant change is the refactoring of resource gathering, so it will not create another cgroup to put itself in. So there will be no nested redundant 'initial' groups, e.x. `/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/initial/initial/initial.../initial` This is part two of splitting the original PR. This PR is an extraction of several commits from https://github.com/scylladb/scylladb/pull/22894 as reviewer https://github.com/scylladb/scylladb/pull/22894?notification_referrer_id=NT_kwDOACiLR7MxNDg0ODk2MDU1MjoyNjU3MDk1&notifications_query=reason%3Aparticipating#pullrequestreview-2778582278. Closes scylladb/scylladb#23882 * github.com:scylladb/scylladb: test.py: add awareness of extra_scylla_cmdline_options test.py: increase timeout for C++ tests in pytest test.py: switch method of finding the root repo directory test.py: move get_combined_tests to the correct facade test.py: add common directory for reports test.py: add the possibility to provide additional env vars test.py: move setup cgroups to the generic method test.py: refactor resource_gather.py	2025-05-06 16:22:49 +03:00
Raphael S. Carvalho	434c2c4649	replica: Fix use-after-free with concurrent schema change and sstable set update When schema is changed, sstable set is updated according to the compaction strategy of the new schema (no changes to set are actually made, just the underlying set type is updated), but the problem is that it happens without a lock, causing a use-after-free when running concurrently to another set update. Example: 1) A: sstable set is being updated on compaction completion 2) B: schema change updates the set (it's non deferring, so it happens in one go) and frees the set used by A. 3) when A resumes, system will likely crash since the set is freed already. ASAN screams about it: SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ... Fix is about deferring update of the set on schema change to compaction, which is triggered after new schema is set. Only strategy state and backlog tracker are updated immediately, which is fine since strategy doesn't depend on any particular implementation of sstable set, since patch "sstables: Implement sstable_set_impl::all_sstable_runs()". Fixes #22040. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-05-06 10:06:55 -03:00
Raphael S. Carvalho	628bec4dbd	sstables: Implement sstable_set_impl::all_sstable_runs() With upcoming change where table::set_compaction_strategy() might delay update of sstable set, ICS might temporarily work with sstable set implementations other than partitioned_sstable_set. ICS relies on all_sstable_runs() during regular compaction, and today it triggers bad_function_call exception if not overriden by set implementation. To remove this strong dependency between compaction strategy and a particular set implementation, let's provide a default implementation of all_sstable_runs(), such that ICS will still work until the set is updated eventually through a process that adds or remove a sstable. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-05-06 10:06:06 -03:00
Botond Dénes	3c3f6ca233	tools/scylla-sstable: scrub: use UUID sstable identifiers Much easier to avoid sstable collisions. Makes it possible to scrub multiple sstables, with multiple calls to scylla-sstable, reusing the same output directory. Previously, each new call to scylla-sstable scrub, would start from generation 0, guaranteeing collision. Remove the unit test for generation clash -- with UUID generations, this is no longer possible to reproduce in practice. Refs: #21387 Closes scylladb/scylladb#23990	2025-05-06 15:09:53 +03:00
Patryk Jędrzejczak	7f843e0a5c	Merge 'raft: make sure to retain the existing voters including the current leader (topology coordinator)' from Emil Maskovsky Fix an issue in the voter calculator where existing voters were not retained across data centers and racks in certain scenarios. This occurred when voters were distributed across more data centers and racks than the maximum allowed number of voters. Previously, the prioritization logic for data centers and racks did not consider the number of existing assigned voters. It only prioritized nodes within a single data center or rack, which could result in unnecessary reassignment of voters. Improved the prioritization logic to account for the number of existing assigned voters in each data center and rack. Additionally, the limited voters feature did not account for the existing topology coordinator (Raft leader) when selecting voters to be removed. As a result, the limited voters calculator could inadvertently remove the votership of the topology coordinator, triggering unnecessary Raft leader re-election. To address this, the topology coordinator's votership status is now preserved unless absolutely necessary. When choosing between otherwise equivalent voters, the node other than the existing topology coordinator is prioritized for removal. This change ensures a more stable voter distribution and reduces unnecessary voter reassignments. The limited voters calculator is refactored to use a priority queue for sorting nodes by their priorities. This change simplifies the voter selection logic and makes it more extensible for future enhancements, such as supporting more complex priority calculations. Fixes: scylladb/scylladb#23950 Fixes: scylladb/scylladb#23588 Fixes: scylladb/scylladb#23786 No backport: The limited voters feature is currently only present in master. Closes scylladb/scylladb#23888 * https://github.com/scylladb/scylladb: raft: ensure topology coordinator retains votership raft: retain existing voters across data centers and racks raft: refactor limited voters calculator to prioritize nodes raft: replace pointer with reference for non-null output parameter raft: reduce code duplication in group0 voter handler raft: unify and optimize datacenter and rack info creation	2025-05-06 13:49:55 +02:00
Nadav Har'El	252c5b5c9d	Merge 'Alternator batch_write_item wcu' from Amnon Heiman This series adds support for WCU tracking in batch_write_item and tests it. The patches include: Switch the metrics (RCU and WCU) to count units vs half-units as they were, to make the metrics clearer for users. Adding a public static get_half_units function to wcu_consumed_capacity_counter for use by batch write item, which cannot directly use the counter object. Adding WCU calculation support to batch_write_item, based on item size for puts and a fixed 1 WCU for deletes. WCU metrics are updated, and consumed capacity is returned per table when requested. The return handling was refactored to be coroutine-like for easier management of the consumed capacity array. Adding tests that validate WCU calculation for batch put requests on a single table and across multiple tables, ensuring delete operations are counted correctly. Adding a test that validates that WCU metrics are updated correctly during batch write item operations, ensuring the WCU of each item is calculated independently. Need backport, WCU is partially supported, and is missing from batch_write_item Fixes #23940 Closes scylladb/scylladb#23941 * github.com:scylladb/scylladb: alternator/test_metrics.py: batch_write validate WCU alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU alternator/executor: add WCU for batch_write_items alternator/consumed_capacity: make wcu get_units public Alternator: Change the WCU/RCU to use units	2025-05-06 13:31:53 +03:00
Gleb Natapov	7403de241c	test: add reproducer for #22777 Add sleep before starting gossiper to increase a chance of getting old gossiper entry about yourself before updating local gossiper info with new IP address.	2025-05-06 11:21:17 +03:00
Botond Dénes	29eedaa0e5	test/cluster/test_read_repair.py: improve trace logging test (again) The test test_read_repair_with_trace_logging wants to test read repair with trace logging. Turns out that node restart + trace-level logging + debug mode is too much and even with 1 minute timeout, the read repair times out sometimes. Refactor the test to use injection point instead of restart. To make sure the test still tests what it supposed to test, use tracing to assert that read repair did indeed happen.	2025-05-06 01:35:17 -04:00
Avi Kivity	fc2204cea0	Merge ' test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits' from Botond Dénes This test has multiple problems: * has 3 embedded loops to run different scenarios, ignores variable from 2 of these, running with hardcoded settings instead * initializes misses and lookups to 0 at the start of each scenario, this throws off per-page increment checks, when the previous scenario moved these metrics and they don't start from 0; this causes the test to sometimes fail * duplicate check of drops == 0 (just cosmetic) Fix all three problems, the second is especially important because it made the test flaky. Additionally, ensure the test will keep using vnodes in the future, by explicitly creating a vnodes keyspace for them. Fixes: #16794 Test fix, not a backport candidate normally, we can backport to 2025.1 if the test becomes too unstable there Closes scylladb/scylladb#23783 * github.com:scylladb/scylladb: test/boost/multishard_mutation_query_test: ensure test runs with vnodes test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits	2025-05-05 20:49:03 +03:00
Emil Maskovsky	24dfd2034b	raft: ensure topology coordinator retains votership The limited voters feature did not account for the existing topology coordinator (Raft leader) when selecting voters to be removed. As a result, the limited voters calculator could inadvertently remove the votership of the current topology coordinator, triggering an unnecessary Raft leader re-election. This change ensures that the existing topology coordinator's votership status is preserved unless absolutely necessary. When choosing between otherwise equivalent voters, the node other than the topology coordinator is prioritized for removal. This helps maintain stability in the cluster by avoiding unnecessary leader re-elections. Additionally, only the alive leader node is considered relevant for this logic. A dead existing leader (topology coordinator) is excluded from consideration, as it is already in the process of losing leadership. Fixes: scylladb/scylladb#23588 Fixes: scylladb/scylladb#23786	2025-05-05 16:58:34 +02:00
Emil Maskovsky	2ae59e8a87	raft: retain existing voters across data centers and racks Fix an issue in the voter calculator where existing voters were not retained across data centers and racks in certain scenarios. This occurred when voters were distributed across more data centers and racks than the maximum allowed number of voters. Previously, the prioritization logic for data centers and racks did not consider the number of existing assigned voters. It only prioritized nodes within a single data center or rack, which could result in unnecessary reassignment of voters. Improved the prioritization logic to account for the number of existing voters in each data center and rack. This change ensures a more stable voter distribution and reduces unnecessary voter reassignments. Fixes: scylladb/scylladb#23950	2025-05-05 16:51:48 +02:00
Emil Maskovsky	018fb63305	raft: refactor limited voters calculator to prioritize nodes Refactor the limited voters calculator to use a priority queue for sorting nodes by their priorities. This change simplifies the voter selection logic and makes it more extensible for future enhancements, such as supporting more complex priority calculations. The priority value is determined based on the node's existing status, including whether it is alive, a voter, or any further criteria.	2025-05-05 16:36:17 +02:00
Emil Maskovsky	26fdc7b8f8	raft: replace pointer with reference for non-null output parameter The output parameter cannot be `null`. Previously, a pointer was used to make it explicit that the parameter is an output parameter being modified. However, this is unnecessary, as references are more appropriate for parameters that cannot be `null`. Switching to a reference improves code readability and ensures the parameter's non-null constraint is enforced at the type level.	2025-05-05 16:12:00 +02:00
Emil Maskovsky	f0468860a3	raft: reduce code duplication in group0 voter handler Refactor the group0 voter handler by introducing a helper lambda to handle the common logic for adding a node. This eliminates unnecessary code duplication. This refactor does not introduce any functional changes but prepares the codebase for easier future modifications.	2025-05-05 16:09:53 +02:00
Botond Dénes	855411caad	test/boost/multishard_mutation_query_test: ensure test runs with vnodes All tests in this suite use the default "ks" keyspace from cql_test_env. This keyspace has tablet support and at any time we might decide to make it use tablets by default. This would make all these tests use the tablet path in multishard_mutation_query.cc. These tests were created to test the vastly more complex vnodes code path in said file. The tablet path is much simpler and it is only used by SELECT * FROM MUTATION_FRAGMENTS() and which has its own correctness tests. So explicitely create a vnodes keyspace and use it in all the tests to restore the test functionality.	2025-05-05 09:22:54 -04:00
Botond Dénes	1175e1ed49	test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits This test has multiple problems: * has 3 embedded loops to run different scenarios, ignores variable from 2 of these, running with hardcoded settings instead * initializes misses and lookups to 0 at the start of each scenario, this throws off per-page increment checks, when the previous scenario moved these metrics and they don't start from 0; this causes the test to sometimes fail * duplicate check of drops == 0 (just cosmetic) Fix all three problems, the second is especially important because it made the test flaky.	2025-05-05 09:22:53 -04:00
Emil Maskovsky	2ef654149f	raft: unify and optimize datacenter and rack info creation Refactor the code to use a consistent pattern for creating the datacenter info list and the rack info list. Both now use a map of vectors, which improves efficiency by reducing temporary conversions to maps/sets during node list processing. Also ensure the node descriptor is passed by reference instead of by copy, leveraging the guaranteed lifetime of the descriptors.	2025-05-05 15:15:17 +02:00
Pavel Emelyanov	cf1ffd6086	Merge 'sstables_loader: fix the racing between get_progress() and release_resources()' from Kefu Chai This change addresses a critical race condition in the sstables_loader where `get_progress()` could access invalid `progress_holder` instances after `release_resources()` destroyed them. Problem: - Progress tracking uses two components: `_progress_state` (tracks state) and `_progress_per_shard` (sharded service with actual progress data) - `get_progress()` first checks if `_progress_state` is initialized, then accumulates progress from `_progress_per_shard` - As both functions are coroutines, `get_progress()` could be preempted after state check but before accessing `_progress_per_shard` - If `release_resources()` runs during this preemption, it destroys the `progress_holder` instances in `_progress_per_shard`, causing `get_progress()` to access invalid memory. Solution: - Implemented shared/exclusive locking to protect access to both state and sharded progress data - Multiple `get_progress()` calls can execute in parallel (shared access) - `release_resources()` acquires exclusive access before modifying resources - This prevents potential memory corruption and ensures consistent progress reporting Fixes #23801 --- this change addresses a racing related to tracking the restore progress from S3 using scylla's native API, which is not used in production yet, hence no need to backport. Closes scylladb/scylladb#23808 * github.com:scylladb/scylladb: sstables_loader: fix the indent sstables_loader: fix the racing between get_progress() and release_resources()	2025-05-05 15:45:15 +03:00
Avi Kivity	e688e89430	tools: toolchain: clear .cache and .cargo directories The .cache and .cargo directories are used during pip and rust builds when preparing the toolchain, but aren't useful afterwards. Remove them to save a bit of space. Closes scylladb/scylladb#23955	2025-05-05 14:43:14 +03:00
Avi Kivity	4c1f4c419c	tools: toolchain: dbuild: run as root in container under podman Running as root enables nested containers under podman without trouble from uid remapping. Unlike docker, under podman uid 0 in the container is remapped to the host uid for bind mounts, so writes to the build directory do not end up owned by root on the host. Nested containers will allow us to consume opensearch, cassandra-stress, and minio as containers rather than embedding them into the frozen toolchain. Closes scylladb/scylladb#23954	2025-05-05 14:40:43 +03:00
Amnon Heiman	2ab99d7a07	alternator/test_metrics.py: batch_write validate WCU This patch adds a test that verifies the WCU metrics are updated correctly during a batch_write_item operation. It ensures that the WCU of each item is calculated independently. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-05-05 13:20:24 +03:00
Amnon Heiman	14570f1bb5	alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU This patch adds two tests: A test that validates WCU calculation for batch put requests on a single table. A test that validates WCU calculation for batch requests across multiple tables, including ensuring that delete operations are counted as 1 WCU. Both tests verify that the consumed capacity is reported correctly according to the WCU rules. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-05-05 13:20:23 +03:00
Amnon Heiman	68db77643f	alternator/executor: add WCU for batch_write_items This patch adds consumed capacity unit support to batch_write_item. It calculates the WCU based on an item's length (for put) or a static 1 WCU (for delete), for each item on each table. The WCU metrics are always updated. if the user requests consumed capacity, a vector of consumed capacity is returned with an entry for each of the tables. For code simplicity, the return part of batch_write_item was updated to be coroutine-like; this makes it easier to manage the life cycle of the returned consumed_capacity array. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-05-05 13:20:14 +03:00
Amnon Heiman	f2ade71f4f	alternator/consumed_capacity: make wcu get_units public This patch adds a public static get_units function to wcu_consumed_capacity_counter. It will be used by the batch write item implementation, which cannot use the wcu_consumed_capacity_counter directly. Signed-off-by: Amnon Heiman <amnon@scylladb.com> consume_capacity need merge	2025-05-05 13:19:04 +03:00
Amnon Heiman	5ae11746fa	Alternator: Change the WCU/RCU to use units This patch changes the RCU/WCU Alternator metrics to use whole units instead of half units. The change includes the following: Change the metrics documentation. Keep the RCU counter internally in half units, but return the actual (whole unit) value. Change the RCU name to be rcu_half_units_total to indicates that it counts half units. Change the WCU to count in whole units instead of half units. Update the tests accordingly. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-05-05 13:18:09 +03:00
Anna Stuchlik	851a433663	doc: add a link to the previous Enterprise documentation This commit adds a link to the docs for previous Enterprise versions at https://enterprise.docs.scylladb.com/ to the left menu. As we still support versions 2024.1 and 2024.2, we need to ensure easier access to those docs sets. Fixes https://github.com/scylladb/scylladb/issues/23870 Closes scylladb/scylladb#23945	2025-05-05 12:16:47 +03:00
Avi Kivity	04fb2c026d	config: decrease default large allocation warning threshold to 128k Back in 2017 (`5a2439e702`), we introduced a check for large allocations as they can stall the memory allocator. The warning threshold was set at 1 MB. Since then many fixes for large allocations went in and it is now time to reduce the threshold further. We reduce it here to 128 kB, the natural allocation size for the system. A quick run showed no warnings. Closes scylladb/scylladb#23975	2025-05-05 12:13:48 +03:00
Pavel Emelyanov	b56d6fbb84	Merge 'sstables: Fix quadratic space complexity in partitioned_sstable_set' from Raphael Raph Carvalho Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so. Closes scylladb/scylladb#23806 * github.com:scylladb/scylladb: test: Verify partitioned set store split and unsplit correctly sstables: Fix quadratic space complexity in partitioned_sstable_set compaction: Wire table_state into make_sstable_set() compaction: Introduce token_range() to table_state dht: Add overlap_ratio() for token range	2025-05-05 11:28:38 +03:00
David Garcia	4ba7182515	docs: fix md redirections for multiversion support This change resolves an issue where selecting a version from the multiversion dropdown on Markdown pages (e.g. https://docs.scylladb.com/manual/stable/alternator/getting-started.html) incorrectly redirected users to the main page instead of the corresponding versioned page. The underlying cause was that the `multiversion` extension relies on `source_suffix` to identify available pages for URL mapping. Without this configuration, proper redirection fails for `.md` files. This fix should be backported to `2025.1` to ensure correct behavior. Otherwise, the fix will only take effect in future releases. Testing locally is non-trivial: clone the repository, apply the changes to each relevant branch, set `smv_remote_whitelist` to "", then run `make multiversionpreview`. Afterward, switch between versions in the dropdown to verify behavior. I've tested it locally, so the best next step is to merge and confirm that it works as expected in the live environment. Closes scylladb/scylladb#23957	2025-05-05 10:39:39 +03:00
Pavel Emelyanov	7b786d9398	topology_coordinator: Use this->_feature_service directly This dependency is already there, topology coordinator doesn't need to use database reference to get to the features. Previous patch of the same kind: `b79137eaa4` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23777	2025-05-05 09:37:29 +02:00
Piotr Dulikowski	05c797795f	Merge 'Simplify test/sstable_assertions class API' from Pavel Emelyanov It had recently been patched to re-use the sstables::test class functionality (scylladb/scylladb#23697), now it can be put on some more strict diet. Closes scylladb/scylladb#23815 * github.com:scylladb/scylladb: test: Remove sstable_assertions::get_stats_metadata() test: Add sstable_assertions::operator->()	2025-05-05 09:33:45 +02:00
Nadav Har'El	834107ae97	test/cqlpy,alternator: fix reporting of Scylla crash during test The cqlpy and alternator test frameworks use a single Scylla node started once for all tests to run on. In the distant past, we had a problem where if one test caused Scylla to crash, the result was a confusing report of hundreds of failed tests - all tests after the crash "failed" and it wasn't easy to find which test really caused the crash. Our old solution to this problem was to have an autouse fixture (called cql_test_connection or dynamodb_test_connection) which tested the connection at the end of each test, and if it detected Scylla has crashed - it used pytest.exit() to report the error and have pytest exit and therefore stop running any further tests (which would have led to all of them testing). This approach had two problems: 1. The pytest.exit() caused the entire cqlpy suite to report a failure, but but not the individual test - the individual test might have failed as well, but that isn't guaranteed and in any case this test's output is missing the informative message that Scylla crashed during the test. This was fine when for each cqlpy failure we had two separate error logs in Jenkins - the specific failed function, and the failed file - but when we recently got rid of the suplication by removing the second one, we no longer see the "Scylla crashed" messages any more. 2. Exiting pytest will be the wrong thing to do if the same pytest run could run tests from different test suites. We don't do this today, but we plan to support this approach soon. This patch fixes both problems by replacing the pytest.exit() call by setting a "scylla_crashed" flag and using pytest.fail(). The pytest.fail() causes the current test - the one which caused Scylla to crash - to be reported as an "ERROR" and the "Scylla crashed" message will correctly appear in this test's log. The flag will cause all other tests in the same test suite to be skip()ed. But other tests in other directories, depending on different fixtures, might continue to run normally. Fixes #23287 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23307	2025-05-05 10:15:56 +03:00
Nadav Har'El	3ce7e250cc	alternator: fix schema "concurrent modification" errors In ScyllaDB, schema modification operations use "optimistic locking": A schema operation reads the current schema, decides what it wants to do and prepares changes to the schema, and then attempts to commit those changes - but only if the schema hasn't changed since the first read. If the schema has already been changed by some other node - we need to try again. In a loop. In Alternator, there are six operations that perform schema modification: CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource and UpdateTimeToLive. All of them were missing this loop. We knew about this - and even had FIXME in all places. So all these operations, when facing contention of concurrent schema modifications on different nodes may fail one of these operations with an error like: Internal server error: service::group0_concurrent_modification (Failed to apply group 0 change due to concurrent modification). This problem had very minor effect, if any, on real users because the DynamoDB SDK automatically retries operations that fail with retryable errors - like this "Internal server error" - and most likely the schema operation will succeed upon retry. However, as shown in issue #13152 these failures were annoying in our CI, where tests - which disable request retries - failed on these errors. This patch fixes all six operations (the last three operations all use one common function, db::modify_tags(), so are fixed by one change) to add the missing loop. The patch also includes reproducing tests for all these operations - the new tests all fail before this patch, and pass with it. These new tests are much more reliable reproducers than the dtests we had that only sometimes - very rarely - reproduced the problem. Moreover, the new tests reproduces the bug seperately for each of the six operations, so if we forget to fix one of the six operations, one of the tests would have continued to fail. Of course I checked this during development. The new tests are in the test/cluster framework, not test/alternator, because this problem can only be reproduced in a multi-node cluster: On a single node, it serializes its schema modifications on its own; The collisions only happen when more than one node attempts schema modifications at the same time. Fixes #13152 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23827	2025-05-05 09:59:08 +03:00
Pavel Emelyanov	d40d6801b0	sstable_directory: Print ks.cf when moving unshared remove sstables When an sstable is identified by sstable_directory as remote-unshared, it will at some point be moved to the target shard. When it happens a log-message appears: sstable_directory - Moving 1 unshared SSTables to shard 1 Processing of tables by sstable_directory often happens in parallel, and messages from sstable_directory are intermixed. Having a message like above is not very informative, as it tells nothing about sstables that are being moved. Equip the message with ks:cf pair to make it more informative. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23912	2025-05-05 09:45:44 +03:00
Pavel Emelyanov	e0f30a30a7	sstable_directory: Print unshared remote sstable when sorting When collecting sstables, the sstable_directory may sort the collected descriptors into one of three buckets -- unshared local and remote, and shared ones. Unshared local and shared sstables' paths are loggerd (with trace level) while unshared remote is silently collected for further processing. Add log message for that case too, there's enough data to print the sstable path as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23913	2025-05-05 09:33:06 +03:00
Gleb Natapov	ecd14753c0	storage_service: Do not remove gossiper entry on address change When gossiper indexed entries by ip an old entry had to be removed on an address change, but the index is id based, so even if ip was change the entry should stay. Gossiper simply updates an ip address there.	2025-05-04 17:59:07 +03:00
Gleb Natapov	a2178b7c31	storage_service: use id to check for local node IP may change and an old gossiper message with previous IP may be processed when it shouldn't. Fixes: #22777	2025-05-04 17:59:07 +03:00
Botond Dénes	51025de755	test/cluster: extract execute_with_tracing() into pylib/util.py To allow reuse in other tests.	2025-05-02 01:53:35 -04:00
Piotr Dulikowski	8ffe4b0308	utils::loading_cache: gracefully skip timer if gate closed The loading_cache has a periodic timer which acquires the _timer_reads_gate. The stop() method first closes the gate and then cancels the timer - this order is necessary because the timer is re-armed under the gate. However, the timer callback does not check whether the gate was closed but tries to acquire it, which might result in unhandled exception which is logged with ERROR severity. Fix the timer callback by acquiring access to the gate at the beginning and gracefully returning if the gate is closed. Even though the gate used to be entered in the middle of the callback, it does not make sense to execute the timer's logic at all if the cache is being stopped. Fixes: scylladb/scylladb#23951 Closes scylladb/scylladb#23952	2025-04-30 16:43:22 +03:00
Benny Halevy	4bd0845fce	gossiper: make send_gossip_echo cancellable Currently send_gossip_echo has a 22 seconds timeout during which _abort_source is ignored. Mark the verb as cancellable so it can be canceled on shutdown / abort. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:46:10 +03:00
Benny Halevy	fa1c3e86a9	gossiper: add send_echo helper CAll send_gossip_echo using a centralized helper. A following patch will make it abortable. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:45:51 +03:00
Benny Halevy	0b97806771	idl, message: make with_timeout and cancellable verb attributes composable And define `send_message_timeout_cancellable` in rpc_protocol_impl.hh using the newly introduced rpc_handler entry point in seastar that accepts both timeout and cancellable params. Note that the interface to the user still uses abort_source while internally the funtion allocates a seastar::rpc::cancellable object. It is possible to provide an interface that will accept a rpc::cancellable& from the caller, but the existing messaging api uses abort_source. Changing it may be considered in the future. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:45:51 +03:00
Benny Halevy	e06d226d08	gossiper: failure_detector_loop_for_node: ignore abort_requested_exception Aborting the failure detector happens normally when the node shuts down. There's no need to log anything about it, as long as we abort the function cleanly. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:05:24 +03:00
Benny Halevy	83c69642f7	gossiper: failure_detector_loop_for_node: check if abort_requested in loop condition The same as the loop condition in the direct_failure_detector. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:05:24 +03:00
Aleksandra Martyniuk	1f4edd8683	test_tablet_tasks: use injection to revoke resize Currently, test_tablet_resize_revoked tries to trigger split revoke by deleting some rows. This method isn't deterministic and so a test is flaky. Use error injection to trigger resize revoke. Fixes: #22570. Closes scylladb/scylladb#23966	2025-04-30 07:04:57 +03:00
Michał Chojnowski	9e2343ecb0	test_sstable_compression_dictionaries_autotrain: raise the timeout There were CI runs in which the training happened as planned, but it was too slow to fit within the timeout. Raise the timeout to pacify the CI. Fixes scylladb/scylladb#23964 Closes scylladb/scylladb#23965	2025-04-29 22:09:14 +03:00
Raphael S. Carvalho	d5bee4c814	test: Verify partitioned set store split and unsplit correctly Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	c77f710a0c	sstables: Fix quadratic space complexity in partitioned_sstable_set Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	21d1e78457	compaction: Wire table_state into make_sstable_set() This will be useful for feeding token range owned by compaction group into sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	59dad2121f	compaction: Introduce token_range() to table_state This provides a way for compaction layer to know compaction group's token range. It will be important for sstable set impl to know the token range of underlying group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	494ed6b887	dht: Add overlap_ratio() for token range Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Patryk Jędrzejczak	0cdcf82cd0	Merge 'topology coordinator: do not proceed further on invalid boostrap tokens' from Piotr Dulikowski In case when dht::boot_strapper::get_boostrap_tokens fail to parse the tokens, the topology coordinator handles the exception and schedules a rollback. However, the current code tries to continue with the topology coordinator logic even if an exception occurs, leaving boostrap_tokens empty. This does not make sense and can actually cause issues, specifically in prepare_and_broadcast_cdc_generation_data which implicitly expect that the bootstrap_tokens of the first node in the cluster will not be empty. Fix this by adding the missing break. Fixes: scylladb/scylladb#23897 From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them. Closes scylladb/scylladb#23914 * https://github.com/scylladb/scylladb: test: cluster: add test_bad_initial_token topology coordinator: do not proceed further on invalid boostrap tokens cdc: add sanity check for generating an empty generation	2025-04-28 12:45:33 +02:00
Michał Chojnowski	7f9152babc	utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity() `chunked_managed_vector` is a vector-like container which splits its contents into multiple contiguous allocations if necessary, in order to fit within LSA's max preferred contiguous allocation limits. Each limited-size chunk is stored in a `managed_vector`. `managed_vector` is unaware of LSA's size limits. It's up to the user of `managed_vector` to pick a size which is small enough. This happens in `chunked_managed_vector::max_chunk_capacity()`. But the calculation is wrong, because it doesn't account for the fact that `managed_vector` has to place some metadata (the backreference pointer) inside the allocation. In effect, the chunks allocated by `chunked_managed_vector` are just a tiny bit larger than the limit, and the limit is violated. Fix this by accounting for the metadata. Also, before the patch `chunked_managed_vector::max_contiguous_allocation`, repeats the definition of logalloc::max_managed_object_size. This is begging for a bug if `logalloc::max_managed_object_size` changes one day. Adjust it so that `chunked_managed_vector` looks directly at `logalloc::max_managed_object_size`, as it means to.	2025-04-28 12:30:13 +02:00
Botond Dénes	d582c436e5	Merge 'tasks: check whether a node is alive before rpc' from Aleksandra Martyniuk Check whether a node is alive before making an rpc that gathers children infos from the whole cluster in virtual_task::impl::get_children. Fixes: https://github.com/scylladb/scylladb/issues/22514. Needs backport to 2025.1 and 6.2 as they contain the bug. Closes scylladb/scylladb#23787 * github.com:scylladb/scylladb: test: add test for getting tasks children tasks: check whether a node is alive before rpc	2025-04-28 09:32:45 +03:00
Nadav Har'El	262530f27c	Merge 'mv: make base_info in view schemas immutable' from Wojciech Mitros Currently, the base_info may or may not be set in view schemas. Even when it's set, it may be modified. This necessitates extra checks when handling view schemas, as we'll as potentially causing errors when we forget to set it at some point. Instead, we want to make the base info an immutable member of view schemas (inside view_info). To achieve this, in this series we remove all base_info members that can change due to a base schema update, and we calculate the remaining values during view update generation, using the most up-to-date base schema version. To calculate the values that depend on the base schema version, we need to iterate over the view primary key and find the corresponding columns, which adds extra overhead for each batch of view updates. However, this overhead should be relatively small, as when creating a view update, we need to prepare each of its columns anyway. And if we need to read the old value of the base row, the relative overhead is even lower. After this change, the base info in view schemas stays the same for all base schema updates, so we'll no longer get issues with base_info being incompatible with a base schema version. Additionally, it's a step towards making the schema objects immutable, which we sometimes incorrectly assumed in the past (they're still not completely immutable yet, as some other fields in view_info other than base_info are initialized lazily and may depend on the base schema version). Fixes https://github.com/scylladb/scylladb/issues/9059 Fixes https://github.com/scylladb/scylladb/issues/21292 Fixes https://github.com/scylladb/scylladb/issues/22194 Fixes https://github.com/scylladb/scylladb/issues/22410 Closes scylladb/scylladb#23337 * github.com:scylladb/scylladb: test: remove flakiness from test_schema_is_recovered_after_dying mv: add a test for dropping an index while it's building base_info: remove the lw_shared_ptr variant view_info: don't re-set base_info after construction base_info: remove base_info snapshot semantics base_info: remove base schema from the base_info schema_registry: store base info instead of base schema for view entries base_info: make members non-const view_info: move the base info to a separate header view_info: move computation of view pk columns not in base pk to view_updates view_info: move base-dependent variables into base_info view_info: set base info on construction	2025-04-27 19:12:12 +03:00
David Garcia	cf7d846b9e	docs: update dependencies This is a mandatory dependency update to resolve a critical Dependabot alert. For more details, see the [Dependabot alerts](https://docs.github.com/en/code-security/dependabot/dependabot-alerts/viewing-and-updating-dependabot-alerts). Closes scylladb/scylladb#23918 Fixes #23935	2025-04-27 18:45:11 +03:00
Piotr Szymaniak	e588c8667f	alternator: Limit attribute name lengths Attribute names are now checked against DynamoDB-compatible length limits. When exceeded, Alternator emits exception identical or similar to the DDB one. It might be worth noting that DDB emits more than a single kind of an exception string for some exceptions. The tests' catch clauses handle all the observed kinds of messages from DynamoDB. The validation differentiates between key and non-key attributes and applies the limit accordingly. AWS DDB raises exceptions with somewhat different contents when the get request contains ProjectionExpression, so this case needed separate treatment to emit the corresponding exception string. The length-validating function was declared and defined in expressions.hh/.cc respectively, because that's where the relevant parsing happens. ** Tests The following tests were validated when handling this issue: test_limit_attribute_length_nonkey_good, test_limit_attribute_length_nonkey_bad, test_limit_attribute_length_key_good, test_limit_attribute_length_key_bad, test_limit_attribute_length_gsi_lsi_good, test_limit_attribute_length_gsi_lsi_bad, test_limit_attribute_length_gsi_lsi_projection_bad. Some of the tests were expanded into being more granular. Namely, there is a new test function `test_limit_attribute_length_key_bad_incoherent_names` which groups tests with too long attribute names in the case of incorrect (incoherent) user requests. Similarily, there is a new test function `test_limit_attribute_length_gsi_lsi_bad_incoherent_names` All the tests cover now each combination of the key/keys being too long. Both the new fuctions contain tests that verify that ScyllaDB throws length-related exceptions (instead of the coherency-related), similar to what DynamoDB does. The new test test_limit_gsiu_key_len_bad covers the case of too long attribute name inside GlobalSecondaryIndexUpdates. The new test test_limit_gsiu_key_len_bad_incoherent_names covers the case of incorrect (incoherent) user requests containing too long attribute names and GlobalSecondaryIndexUpdates. test_limit_attribute_length_key_bad was found to have contaned an illegal KeySchema structure. Some of the tests were corrected their match clause. All the tests are stripped of the xfail flag except test_limit_attribute_length_key_bad, which has it changed since it still fails due to Projection in GSI and LIS not implemented in Alternator. The xfail now points to #5036. Fixes scylladb/scylladb#9169 Closes scylladb/scylladb#23097	2025-04-27 18:39:20 +03:00
Piotr Dulikowski	82e1678fbe	test: mv: skip test_mv_tablets_empty_ip in debug mode This test shuts down a node and then replaces it with another one while continuously writing to the cluster. The test has been observed to take a lot of time in debug mode and time out on the replace operation. Replace takes very long because rebuilding tablets on the new node is very slow, and the slowest part is memtable flush which happens at the beginning of streaming. The slowness seems to be specific to the debug mode. Turn off the test in debug mode to deflake the CI. As a follow-up, the test is planned to be reworked into an quicker error injection test so that the code path tested by this test will be again exercised in debug unit tests (scylladb/scylladb#23898) Fixes: scylladb/scylladb#20316 Closes scylladb/scylladb#23900	2025-04-27 18:06:08 +03:00
Piotr Dulikowski	670a69007e	test: cluster: add test_bad_initial_token Adds a test which checks that rollback works properly in case when a bad value of the initial_token function is provided.	2025-04-25 12:25:15 +02:00
Piotr Dulikowski	845cedea7f	topology coordinator: do not proceed further on invalid boostrap tokens In case when dht::boot_strapper::get_boostrap_tokens fail to parse the tokens, the topology coordinator handles the exception and schedules a rollback. However, the current code tries to continue with the topology coordinator logic even if an exception occurs, leaving boostrap_tokens empty. This does not make sense and can actually cause issues, specifically in prepare_and_broadcast_cdc_generation_data which implicitly expect that the bootstrap_tokens of the first node in the cluster will not be empty. Fix this by adding the missing break. Fixes: scylladb/scylladb#23897	2025-04-25 11:30:01 +02:00
Piotr Dulikowski	66acaa1bf8	cdc: add sanity check for generating an empty generation It doesn't make sense to create an empty CDC generation because it does not make sense to have a cluster with no tokens. Add a sanity check to cdc::make_new_generation_description which fails if somebody attempts to do that (i.e. when the set of current tokens + optionally bootstrapping node's tokens is empty). The function does not work correctly if it is misused, as we saw in scylladb/scylladb#23897. While the function should not be misused in the first place, it's better to throw an exception rather than crash - especially that this crash could happen on the topology coordinator.	2025-04-25 11:25:07 +02:00
Aleksandra Martyniuk	76cd707b18	test: test_tablets: wait for cql Wait for cql after rolling restart in test_two_tablets_concurrent_repair_and_migration_repair_writer_level to prevent failing queries. Fixes: #23620. Closes scylladb/scylladb#23796	2025-04-24 21:25:29 +03:00
Patryk Jędrzejczak	2a8bb47cfb	test: test_zero_token_nodes_topology_ops: use host IDs for ignored nodes Providing IP of an ignored node during removenode made the test flaky. It could happen that the address map contained mappings of two nodes with the same IP: 1. the node being ignored, 2. the node that expectedly failed replacing earlier in the test. So, `address_map::find_by_addr()` called in `find_raft_nodes_from_hoeps` could return the host ID of the second node instead of the first node and cause removenode to fail. We fix flakiness in this patch by providing the host ID of the ignored node instead of its IP. We would have to do it anyway sooner or later because providing IP is deprecated. The bug in `find_raft_nodes_from_hoeps` is tracked by scylladb/scylladb#23846. The test became flaky because of `f0af3f261e`. That patch is not present in 2025.1, so the test isn't flaky outside master, and hence there is no reason to backport this patch. Fixes scylladb/scylladb#23499 Closes scylladb/scylladb#23863	2025-04-24 20:17:19 +03:00
Pavel Emelyanov	68a178eba9	Merge 'replica: skip flush of dropped table' from Aleksandra Martyniuk Currently, flush throws no_such_column_family if a table is dropped. Skip the flush of dropped table instead. Fixes: #16095. Needs backport to 2025.1 and 6.2 as they contain the bug Closes scylladb/scylladb#23876 * github.com:scylladb/scylladb: test: test table drop during flush replica: skip flush of dropped table	2025-04-24 20:02:59 +03:00
Andrei Chekun	22ef09489d	test.py: add awareness of extra_scylla_cmdline_options test_config.yaml can have field extra_scylla_cmdline_options that previously was not added to the commandline to start Scylla. Now any extra options will be added to commandline to start tests	2025-04-24 14:05:50 +02:00
Andrei Chekun	2758c4a08e	test.py: increase timeout for C++ tests in pytest Current timeouts it not enough. Tests failed randomly with hitting timeout. This will allow to test finish normally. As a downside if the process will hang we will be waiting more. This adjustments will be changed after we will have metrics how long it takes to test to pass in each mode.	2025-04-24 14:05:50 +02:00
Andrei Chekun	f5c88e1107	test.py: switch method of finding the root repo directory Switching to use constant defined in __init__ filet instead of getting the root directory from pytest's config. This is will allow to have only one source of truth in defining the root directory of the project to avoid cases when root directory defined incorrectly. This change also simplifies potential changes in future.	2025-04-24 14:05:50 +02:00
Andrei Chekun	06eca04370	test.py: move get_combined_tests to the correct facade Since get_combined_tests method is used only for boost tests and not all C++ tests, moving it into the correct place	2025-04-24 14:05:49 +02:00
Andrei Chekun	8cc9c0a53a	test.py: add common directory for reports When test.py executing python test it executes it by mode and by file, so it can say where the report should with mode. With new approach pytest will execute the tests for all modes inside himself, and we can only have one report per pytest invocation. That's why we need common directory for reports and not under the mode directory. It can later be used for simplification, so any report should be there.	2025-04-24 14:05:49 +02:00
Andrei Chekun	b791af1f16	test.py: add the possibility to provide additional env vars This will allow inject any environment variable to the test, because previosly it was taking only the environment variables from the process. Adding injecting ASAN and UBSAN variablet to the tests	2025-04-24 14:05:49 +02:00
Andrei Chekun	3cb5838619	test.py: move setup cgroups to the generic method This changes needed for later integration for pytest executing the C++ tests to be able to gather resource metric.	2025-04-24 14:05:49 +02:00
Andrei Chekun	ca615af407	test.py: refactor resource_gather.py Refactor resource_gather.py to not create the initial cgroup when the process it's already in it. This will allow not going deeper, creating again and again the same cgroup with each test.py execution when the terminal isn't closed. Add creation of own event loop in case it's not exists. This needed to be able to work with test.py that creates loop and with pytest that not create loop.	2025-04-24 14:05:49 +02:00
Wojciech Mitros	ee5883770a	test: remove flakiness from test_schema_is_recovered_after_dying Due to the changes in creating schemas with base info the test_schema_is_recovered_after_dying seems to be flaky when checking that the schema is actually lost after 'grace_period'. We don't actually guarantee that the the schema will be lost at that exact moment so there's no reason to test this. To remove the flakiness, we remove the check and the related sleep, which should also slightly improve the speed of this test.	2025-04-24 01:09:35 +02:00
Wojciech Mitros	bf7bba9634	mv: add a test for dropping an index while it's building Dropping an index is a schema change of its base table and a schema drop of the index's materialized view. This combination of schema changes used to cause issues during view building, because when a view schema was dropped, it wasn't getting updated with the new version of the base schema, and while the view building was in progress, we would update the base schema for the base table mutation reader and try generating updates with a view schema that wasn't compatible with the base schema, failing on an `on_internal_error`. In this patch we add a test for this scenario. We create an index, halt its view building process using an injection, and drop it. If no errors are thrown, the test succeeds. The test was failing before https://github.com/scylladb/scylladb/pull/23337 and is passing afterwards.	2025-04-24 01:09:32 +02:00
Wojciech Mitros	d77f11d436	base_info: remove the lw_shared_ptr variant The base_dependent_view_info is no longer needed to be shared or modified in the view_info, so we no longer need to keep it as a shared pointer.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	d7bd86591e	view_info: don't re-set base_info after construction In the previous commits we made sure that the base info is not dependent on the base schema version, and the info dependent on the base schema version is calculated when it's needed. In this patch we remove the unnecessary re-setting of the base_info. The set_base_info method isn't removed completely, because it also has a secondary function - zeroing the view_info fields other than base_info. Because of this, in this patch we rename it accordingly and limit its use to the updates caused by a base schema change.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	ea462efa3d	base_info: remove base_info snapshot semantics The base info in view schemas no longer changes on base schema updates, so saving the base info with a view schema from a specific point in time doesn't provide any additional benefits. In this patch we remove the code using the base_and_view snapshots as it's no longer useful.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	ad55935411	base_info: remove base schema from the base_info The base info now only contains values which are not reliant on the base schema version. We remove the the base schema from the base info to make it immutable regardless of base schema version, at the point of this patch it's also not needed anywhere - the new base info can replace the base schema in most places, and in the few (view_updates) where we need it, we pull the most recent base schema version from the database. After this change, the base info no longer changes in a view schema after creation, so we'll no longer get errors when we try generating view updates with a base_info that's incompatible with a specific base schema version. Fixes #9059 Fixes #21292 Fixes #22410	2025-04-24 01:08:39 +02:00
Wojciech Mitros	05fce91945	schema_registry: store base info instead of base schema for view entries In the following patch we plan to remove the base schema from the base_info to make the base_info immutable. To do that, we first prepare the schema registry for the change; we need to be able to create view schemas from frozen schemas there and frozen schemas have no information about the base table. Unless we do this change, after base schemas are removed from the base info, we'll no longer be able to load a view schema to the schema registry without looking up the base schema in the database. This change also required some updates to schema building: * we add a method for unfreezing a view schema with base info instead of a base schema * we make it possible to use schema_builder with a base info instead of a base schema * we add a method for creating a view schema from mutations with a base info instead of a base schema * we add a view_info constructor withat base info instead of a base schema * we update the naming in schema_registry to reflect the usage of base info instead of base schema	2025-04-24 01:08:39 +02:00
Wojciech Mitros	6e539c2b4d	base_info: make members non-const In the following patches we'll add the base info instead of the base schema to various places (schema building, schema registry). There, we'll sometimes need to update the base_info fields, which we can't do with const members. There's also a place (global_schema_ptr) where we won't be able to use the base_info_ptr (a shared pointer to the base_info), so we can't just use the base_info_ptr everywhere instead. In this patch we unmark these members as const. In the following patches we'll remove the methods for changing the base_info in the view schema, so it will remain effectively const.	2025-04-24 01:08:39 +02:00
Wojciech Mitros	32258d8f9a	view_info: move the base info to a separate header In the following commits the base_depenedent_view_info will be needed in many more places. To avoid including the whole db/view/view.hh or forward declaring (where possible) the base info, we move it to a separate header which can be included anywhere at almost no cost.	2025-04-24 01:08:39 +02:00
Wojciech Mitros	a3d2cd6b5e	view_info: move computation of view pk columns not in base pk to view_updates In preparation of making the base_info immutable, we want to get rid of any base_dependent_view_info fields that can change when base schema is updated. The _base_regular_columns_in_view_pk and _base_static_columns_in_view_pk base column_ids of corresponding base columns and they can change (decrease) when an earlier column is dropped in the base table. view_updates is the only location where these values are used and calculating them is not expensive when comparing to the overall work done while performing a view update - we iterate over all view primary key columns and look them up in the base table. With this in mind, we can just calculate them when creating a view_updates object, instead of keeping them in the base_info. We do that in this patch.	2025-04-24 01:08:39 +02:00
Wojciech Mitros	a33963daef	view_info: move base-dependent variables into base_info The has_computed_column_depending_on_base_non_primary_key and is_partition_key_permutation_of_base_partition_key variables in the view_info depend on the base table so they should be in the base_dependent_view_info instead of view_info.	2025-04-24 01:08:39 +02:00
Wojciech Mitros	900687c818	view_info: set base info on construction Currently, the base_info may or may not be set in view schemas. Even when it's set, it may be modified. This necessitates extra checks when handling view schemas, as well as potentially causing errors when we forget to set it at some point. Instead, we want to make the base info an immutable member of view schemas (inside view_info). The first step towards that is making sure that all newly created schemas have the base info set. We achieve that by requiring a base schema when constructing a view schema. Unfortunately, this adds complexity each time we're making a view schema - we need to get the base schema as well. In most cases, the base schema is already available. The most problematic scenario is when we create a schema from mutations: - when parsing system tables we can get the schema from the database, as regular tables are parsed before views - when loading a view schema using the schema loader tool, we need to load the base additionally to the view schema, effectively doubling the work - when pulling the schema from another node - in this case we can only get the current version of the base schema from the local database Additionally, we need to consider the base schema version - when we generate view updates the version of the base schema used for reads should match the version of the base schema in view's base info. This is achieved by selecting the correct (old or new) schema in `db::schema_tables::merge_tables_and_views` and using the stored base schema in the schema_registry.	2025-04-24 01:08:39 +02:00
Benny Halevy	f279625f59	test_tablets_cql: test_alter_dropped_tablets_keyspace: extend expected error The query may fail also on a no_such_keyspace exception, which generates the following cql error: ``` Error from server: code=2200 [Invalid query] message="Can\'t find a keyspace test_1745198244144_qoohq" ``` Extend the pytest.raises match expression to include this error as well. Fixes #23812 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23875	2025-04-23 18:54:22 +03:00
Benny Halevy	2bbdaeba1c	Update seastar submodule * seastar e44af9b0...d7ff58f2 (2): > rpc: client: support timeout and cancellation > doc/io-properties-file.md: correct a typo Closes scylladb/scylladb#23865	2025-04-23 16:10:51 +03:00
Aleksandra Martyniuk	c1618c7de5	test: test table drop during flush	2025-04-23 14:29:28 +02:00
Aleksandra Martyniuk	91b57e79f3	replica: skip flush of dropped table	2025-04-23 14:29:28 +02:00
Kefu Chai	0d7752b010	build: cmake: generalize update_cxx_flags() Refactor our CMake flag handling to make it more flexible and reduce repetition: - Rename update_cxx_flags() to update_build_flags() to better reflect its expanded purpose - Generate CMake variable names internally based on configuration type instead of requiring callers to specify full variable names - Follow CMake's standard naming conventions for configuration-specific flags, see https://cmake.org/cmake/help/latest/variable/CMAKE_LANG_FLAGS.html#variable:CMAKE_%3CLANG%3E_FLAGS - Prepare groundwork for handling linker flags in addition to compiler flags in future changes Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23842	2025-04-23 12:06:04 +03:00
Nadav Har'El	64a5eee6b9	test/cqlpy: insert test names into Scylla logs Both test.py and test/cqlpy/run run many test functions against the same Scylla process. In the resulting log file, it is hard to understand which log messages are related to which test. In this patch, we log a message (using the "/system/log" REST API) every time a test is started or ends. The messages look like this: INFO 2025-04-22 15:10:44,625 [shard 1:strm] api - /system/log: test/cqlpy: Starting test_lwt.py::test_lwt_missing_row_with_static ... INFO 2025-04-22 15:10:44,631 [shard 0:strm] api - /system/log: test/cqlpy: Ended test_lwt.py::test_lwt_missing_row_with_static We already had a similar feature in test/alternator, added three years ago in commit `b0371b6bf8`. The implementation is similar but not identical due to different available utility functions, and in any case it's very simple. While at it, this patch also fixes the has_rest_api() to timeout after one second. Without this, if the REST API is blocked in a way that a connection attempt just hangs, the tests can hang. With the new timeout, the test will hang for a second, realize the REST API is not available, and remember this decision (the next tests will not wait one second again). We had the same bug in Alternator, and fixed it in `758f8f01d7`. This one second "pause" will only happen if the REST API port is blocked - in the more typical case the REST API port is just not listening but not blocked, and the failure will be noticed immediately and won't wait a whole second. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23857	2025-04-23 12:04:14 +03:00
Piotr Dulikowski	3d73c79a72	test: mv: skip test_view_building_scheduling_group in debug The test populates a table with 50k rows, creates a view on that table and then compares the time spent in streaming vs. gossip scheduling groups. It only takes 10s in dev mode on my machine, but is much slower in debug mode in CI - building the view doesn't finish within 2 minutes. The bigger the view to build, the more accurrate the measurement; moreover, the test scenario isn't interesting enough to be worth running it in debug mode as this should be covered by other tests. Therefore, just skip this test in debug mode. Fixes: scylladb/scylladb#23862 Closes scylladb/scylladb#23866	2025-04-23 11:29:35 +03:00
Pavel Emelyanov	a6ba535c3c	Merge 'test.py: refactoring before boost pytest integration' from Andrei Chekun This PR contains changes that do not add new functionality, and have small refactoring of the existing code. The most significant change though is switching the SQLite writer from a singleton to a thread locking mechanism that will be needed later on. This PR is an extraction of several commits from https://github.com/scylladb/scylladb/pull/22894 as reviewer [request](https://github.com/scylladb/scylladb/pull/22894?notification_referrer_id=NT_kwDOACiLR7MxNDg0ODk2MDU1MjoyNjU3MDk1&notifications_query=reason%3Aparticipating#pullrequestreview-2778582278). Closes scylladb/scylladb#23867 * github.com:scylladb/scylladb: test.py: move the readme file for LDAP tests to the correct location test.py: eliminate deprecation warning for xml.etree.ElementTree.Element test.py: align the behavior of max-failures parameter with pytest maxfail test.py: fix typo in toxiproxy name parameter test.py: add locking to the sqlite writer for resource gather test.py: add sqlite datetime adapter for resource gather test.py: change the parameter for get_modes_to_run()	2025-04-23 11:10:56 +03:00
Andrzej Jackowski	3c69340b8c	test: add test_long_query_timeout_erm This commit adds a test to verify that a query with long timeout doesn't block ERM on failure. The motivation for the test is fixing scylladb#21831. This commit: - add test_long_query_timeout_erm	2025-04-23 09:29:47 +02:00
Andrzej Jackowski	1f1e4f09cd	test: add get_cql_exclusive to manager_client.py This commit adds to ManagerClient a get_cql_exclusive function that allows creating a cql connection with WhiteListRoundRobinPolicy for a single server. Such connection is useful in tests that kill nodes to make sure that the live node handles the queries. Before this commit, some tests used cluster_con from test/cluster/conftest.py, and after this commit test can start to use a method from MangerClient. This change: - Extend ManagerClient con_gen type to allow LoadBalancingPolicy arg - Implement get_cql_exclusive()	2025-04-23 09:29:47 +02:00
Andrzej Jackowski	9d53063a7e	mapreduce: catch local read_failure_exception_with_timeout Mapreduce Service exception handling differs for local and remote RPC calls of dispatch_to_shards. Whereas local exceptions are handled normally, the remote exceptions are converted to rpc::remote_verb_error by the framework. This is a substantial difference when read_failure_exception_with_timeout is thrown during mapreduce query execution - CQL server waits for the exception from the local call but not from the remote one. As we don't want to wait for the timeout in CQL server in either of the cases, this commit catches the local exception (especially read_failure_exception_with_timeout) and converts it to std::runtime_error (the one from which rpc::remote_verb_error inherits). Ideally, Mapreduce Service should execute dispatch_to_shards through RPC for both local and remote calls. However, such change negatively affects tens of Unit Tests that rely on the possibility to run local mapreduce service without any RPC. This change: - Catch local exceptions in Mapreduce Service and convert them to std::runtime_error.	2025-04-23 09:29:47 +02:00
Andrzej Jackowski	1fca994c7b	transport: storage_proxy: release ERM when waiting for query timeout Before this change, if a read executor had just enough targets to achieve query's CL, and there was a connection drop (e.g. node failure), the read executor waited for the entire request timeout to give drivers time to execute a speculative read in a meantime. Such behavior don't work well when a very long query timeout (e.g. 1800s) is set, because the unfinished request blocks topology changes. This change implements a mechanism to thrown a new read_failure_exception_with_timeout in the aforementioned scenario. The exception is caught by CQL server which conducts the waiting, after ERM is released. The new exception inherits from read_failure_exception, because layers that don't catch the exception (such as mapreduce service) should handle the exception just a regular read_failure. However, when CQL server catch the exception, it returns read_timeout_exception to the client because after additional waiting such an error message is more appropriate (read_timeout_exception was also returned before this change was introduced). This change: - Add new read_failure_exception_with_timeout exception - Add throw of read_failure_exception_with_timeout in storage_proxy - Add abort_source to CQL server, as well as to_stop() method for the correct abort handling - Add sleep in CQL server when the new exception is caught Refs #21831	2025-04-23 09:29:47 +02:00
Andrzej Jackowski	9b1f062827	transport: remove redundant references in process_request_one The references were added and used in previous commits to limit the number of line changes for a reviewer convenience. This commit removes the redundant references to make the code more clear and concise.	2025-04-23 09:29:47 +02:00
Andrzej Jackowski	9c0f369cf8	transport: fix the indentation in process_request_one Fix the indentation after the previous commit that intentionally had a wrong indent to limit the number of changed lines	2025-04-23 09:29:47 +02:00
Andrzej Jackowski	8a7454cf3e	transport: add futures in CQL server exception handling Prepare for the next commit that will introduce a seastar::sleep in handling of selected exception. This commit: - Rewrite cql_server::connection::process_request_one to use seastar::futurize_invoke and try_catch<> instead of utils::result_try. - The intentation is intentionally incorrect to reduce the number of changed lines. Next commits fix it.	2025-04-23 09:29:05 +02:00
Andrei Chekun	57b66e6b2e	test.py: move the readme file for LDAP tests to the correct location README file was created in incorrect location, now it moved to the directory with source files where it intended to be.	2025-04-22 19:03:28 +02:00
Andrei Chekun	cf4747c151	test.py: eliminate deprecation warning for xml.etree.ElementTree.Element Testing the truth value of an Element emits DeprecationWarning. This check is done correctly	2025-04-22 19:03:21 +02:00
Andrei Chekun	bc49cd5214	test.py: align the behavior of max-failures parameter with pytest maxfail This will allow to just transfer the existing max-failures values to the pytest without any modification. As a downside test.py logic of handling these changes slightly.	2025-04-22 19:03:08 +02:00
Andrei Chekun	5c3501e4bf	test.py: fix typo in toxiproxy name parameter Fix typo in toxiproxy name parameter. No any functional changes just cosmetic fix.	2025-04-22 19:02:12 +02:00
Andrei Chekun	2c37a793d1	test.py: add locking to the sqlite writer for resource gather SQLite blocking the DB during writes, so it's not possible to make writes from several thread. To be able to gather metrics in several threads, we need a locking mechanism for threads during writes. So thread will not try to write metrics while another thread is performing writes.	2025-04-22 19:01:30 +02:00
Andrei Chekun	800710dc2c	test.py: add sqlite datetime adapter for resource gather Add sqlite datetime adapter for resource gather since default adapters are deprecated from 3.12	2025-04-22 18:59:49 +02:00
Andrei Chekun	bf2a9e267e	test.py: change the parameter for get_modes_to_run() Change the parameter for get_modes_to_run() from session to config to narrow the scope, and prepare it to later use in method that do not have access to the session, but have access to the config object	2025-04-22 18:58:33 +02:00
Kefu Chai	7254c0c515	db/config.cc: correct a typo in option's description s/incomming/incoming/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23826	2025-04-22 16:55:04 +03:00
Pavel Emelyanov	65efd2b2f6	Merge 'Refactor and enhance s3_tests' from Ernest Zaslavsky This PR introduces a cleanup mechanism in s3_tests to remove uploaded objects after the test completes, ensuring a clean testing environment. Additionally, the recently added test has been refactored and split into smaller, more maintainable parts, improving readability and extending its coverage to include the "proxied" case. As these changes primarily improve code aesthetics and maintainability, backporting is not necessary. Refs: https://github.com/scylladb/scylladb/issues/23830 Closes scylladb/scylladb#23828 * github.com:scylladb/scylladb: s3_tests: Improve and extend copy object test coverage s3_tests: Implement post-test cleanup for uploaded objects	2025-04-22 16:40:37 +03:00
Nadav Har'El	5fd2eabd48	Merge 'Generalize the diversity of parse_table_infos() callers in API' from Pavel Emelyanov The helper in question is used in several different ways -- by handlers directly (most of the callers), as a part of wrap_ks_cf() helper and by one of its overloads that unpack the "cf" query parameter from request. This PR generalizes most of the described callers thus reducing the number differently-looking of ways API handlers parse "keyspace" and "cf" request parameters. Continuation of #22742 Closes scylladb/scylladb#23368 * github.com:scylladb/scylladb: api: Squash two parse_table_infos into one api: Generalize keyspaces:tables parsing a little bit more api: Provide general pair<keyspace, vector<table>> parsing api: Remove ks_cf_func and related code	2025-04-22 15:40:06 +03:00
Nadav Har'El	8d1a413357	test/scylla_gdb: better error message when running on dev build mode The test/scylla_gdb suite needs Scylla to have been built with debug symbols - which is NOT the case for the dev build. So the script test/scylla_gdb/run attempts to recognize when a developer runs it on an executable with the debug symbols missing - and prints a clear error. Unfortunately, as we noticed in #10863, and again in #23832, because wasmtime is compiled with debug symbols and linked with Scylla, build/dev/scylla "pretends" to have debug symbols, foiling the check in test/scylla_gdb/run. Reviewers rejected two solutions to this problem (pull requests #10865 and #10923), so in pull request #10937 I added a cosmetic solution just for test/scylla_gdb: in test/scylla_gdb/conftest.py we check that there are really debug symbols that interest us, and if not, exit immediately instead of failing each test separately. For some reason, the sys.exit() we used is no longer effective - it no longer exits pytest, so in this patch we use pytest.exit() instead. Fixes #23832 (sort of, we leave build/dev/scylla with the fake claim that it has debug symbols, but test/scylla_gdb will handle this situation more gracefully). Closes scylladb/scylladb#23834	2025-04-22 15:02:06 +03:00
Michael Litvak	5c1d24f983	test: test_mv_topology_change: increase timeout for remove_node The test `test_mv_write_to_dead_node` currently uses a timeout of 60 seconds for remove_node, after it was increased from 30 seconds to fix scylladb/scylladb#22953. Apparently it is still too low, and it was observed to fail in debug mode. Normally remove_node uses a default timeout of TOPOLOGY_TIMEOUT = 1000 seconds, but the test requires a timeout which is shorter than 5 minutes, because it is a regression test for an issue where MV updates hold topology changes for more than 5 minutes, and we want to verify in the test that the topology change completes in less than 5 minutes. To resolve the issue, we set the test to skip in debug mode, because the remove node operation is unpredictably slow, and we increase the timeout to 180 seconds which is hopefully enough time for remove_node in non-debug modes, and still sufficient to satisfy the test requirements. Fixes scylladb/scylladb#22530 Closes scylladb/scylladb#23833	2025-04-22 10:51:19 +02:00
Kefu Chai	a2b46cbf45	sstables_loader: fix the indent Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-04-22 12:05:55 +08:00
Kefu Chai	6b3ecad467	sstables_loader: fix the racing between get_progress() and release_resources() This change addresses a critical race condition in the sstables_loader where `get_progress()` could access invalid `progress_holder` instances after `release_resources()` destroyed them. Problem: - Progress tracking uses two components: `_progress_state` (tracks state) and `_progress_per_shard` (sharded service with actual progress data) - `get_progress()` first checks if `_progress_state` is initialized, then accumulates progress from `_progress_per_shard` - As both functions are coroutines, `get_progress()` could be preempted after state check but before accessing `_progress_per_shard` - If `release_resources()` runs during this preemption, it destroys the `progress_holder` instances in `_progress_per_shard`, causing `get_progress()` to access invalid memory. Solution: - Implemented shared/exclusive locking to protect access to both state and sharded progress data - Multiple `get_progress()` calls can execute in parallel (shared access) - `release_resources()` acquires exclusive access before modifying resources - This prevents potential memory corruption and ensures consistent progress reporting Fixes #23801 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-04-22 12:05:54 +08:00
Ernest Zaslavsky	edaa3f4bdd	s3_tests: Improve and extend copy object test coverage Refactored the copy object test to enhance readability and maintainability. The test was simplified and split into smaller, more focused parts. Additionally, a "proxied" variant of the test was introduced to expand coverage.	2025-04-21 20:54:14 +03:00
Ernest Zaslavsky	252a0a14af	s3_tests: Implement post-test cleanup for uploaded objects Ensure cleanup after tests by deleting objects uploaded to MinIO. This improves resource management and maintains a clean test environment.	2025-04-21 20:54:14 +03:00
Avi Kivity	2dcd2b21ae	Merge 'tablets: Equalize per-table balance when allocating tablets for a new table' from Tomasz Grabiec Fixes the following scenario: 1. Scale out adds new nodes to each rack 2. Table is created - all tablets are allocated to new nodes because they have low load 3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed We're wrong to try to equalize global load when allocating tablets, and we should equalize per-table load instead, and let background load balancing fix it in a fair way. It will add to the allocated storage imbalance, but: 1. The table is initially empty, so doesn't impact actual storage imbalance. 2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately. 3. If the table was created before imbalance was formed, we would end up in the same situation as in the problematic scenario after the patch. 4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in. Before we have CPU-aware tablet allocation, and thus can prove we have CPU capacity on the small nodes, we should respect per-table balance as this is the way in which we achieve full CPU utilization. Fixes #23631 Backport to 2025.1 because load imbalance is a serious problem in production. Closes scylladb/scylladb#23708 * github.com:scylladb/scylladb: tablets: Equalize per-table balance when allocating tablets for a new table load_sketch: Tolerate missing tablet_map when selecting for a given table tests: tablets: Simplify tests by moving common code to topology_builder	2025-04-21 17:06:30 +03:00
Pavel Emelyanov	eb5b52f598	Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278 Fixes: scylladb/scylladb#22869 Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga Closes scylladb/scylladb#23800 * github.com:scylladb/scylladb: doc: changing topology when changing snitches is no longer supported test: cluster: introduce test_no_dc_rack_change storage_service: don't update DC/rack in update_topology_with_local_metadata main: make dc and rack immutable after bootstrap test: cluster: remove test_snitch_change	2025-04-21 15:52:55 +03:00
Yaniv Michael Kaul	b374f94b15	pip installation: use --no-cache-dir There are two reasons we may want NOT to use caching of pip deps: 1. When building a container, unless we specifically clean it up, it'll remain, even when we squash the image layers later. 2. When building a container, that cache is not useful, as we squash our containers later (so that layer is not cached really). And our CI cleans up the layers repo anyway. 3. Caching sometimes isn't great, and doesn't ensure we pick up the exact version (or latest) that we wish to... This PR changes two locations in Scylla, both of which (also) build containers, so certainly relevant for 1, 2 above and possibly 3. No real need to backport. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#23822	2025-04-21 13:46:57 +03:00
Avi Kivity	0ba3ce1741	test: gdb: avoid using `file(1)` to determine if debug information is present The scylla_gdb tests verify, as a sanity check, that the executable was built with debug information. They do so via file(1). In Fedora 42, file(1) crashes on ELF files that have interpreter pathnames larger than 128 characters[1]. This was later fixed[2], but the fix is not in any release. Work around the problem by using objdump instead of file. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2354970 [2] `b3384a1fbf` Closes scylladb/scylladb#23823	2025-04-21 13:29:27 +03:00
Andrei Chekun	441cee8d9c	test.py: fix gathering logs in case of fail Currently log files have information about run_id twice: cluster.object_store_test_backup.10.test_abort_restore_with_rpc_error.dev.10_cluster.log However, sometimes the first run_id can be incorrect: cluster.object_store_test_backup.1.test_abort_restore_with_rpc_error.dev.10_cluster.log Removing first run_id in the name to not face this issue and because it's actually redundant. Removing creation empty file for scylla manager log, since it redundant and was done as incorrect assumption on the root cause of the fail. Add extension to the stacktrace file, so it will be opened in the browser in Jenkins in the new tab instead of downloading it. Fixes: https://github.com/scylladb/scylladb/issues/23731 Closes scylladb/scylladb#23797	2025-04-21 13:12:35 +03:00
Pavel Emelyanov	09caad6147	test: Remove sstable_assertions::get_stats_metadata() It mirrors the sstable method of the same name, which is public. With -> operator, it's just as convenient to call it directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-04-18 18:53:41 +03:00
Pavel Emelyanov	294e56207d	test: Add sstable_assertions::operator->() ... and replace get_sstable() with it. It's more natural (despite having the only user) to consider the class to be yet another "pointer" to an sstable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-04-18 18:52:39 +03:00
Sergey Zolotukhin	2314feeae2	test: Ignore DEBUG,TRACE,INFO level messages when checking for failed mutations. Update the regular expression in `check_node_log_for_failed_mutations` to avoid false test failures when DEBUG-level logging is enabled. Fixes scylladb/scylladb#23688 Closes scylladb/scylladb#23658	2025-04-18 16:17:41 +03:00
Calle Wilund	4a44651fce	encryption_at_rest_test: Make fake_proxy read/write loop noexcept Fixes #23774 Test code falls into same when_all issue as http client did. Avoid passing exceptions through this, and instead catch and report in worker lambda. Closes scylladb/scylladb#23778	2025-04-18 16:17:41 +03:00
Pavel Emelyanov	324daac156	Merge 'Add CopyObject API implementation to S3 client' from Ernest Zaslavsky Implement the CopyObject API to directly copy S3 object from one location to another. This implementation consumes zero networking overhead on the client side since the object is copied internally by S3 machinery Usage example: Backup of tiered SSTables - you already have SSTables on S3, CopyObject is the ideal way to go No need to backport since we are adding new functionality for a future use Closes scylladb/scylladb#23779 * github.com:scylladb/scylladb: s3_client: implement S3 copy object s3_client: improve exception message s3_client: reposition local function for future use	2025-04-18 16:17:41 +03:00
Pavel Emelyanov	cc919b08c2	Merge 'backup: Optimize S3 throughput with shard-based upload' from Ernest Zaslavsky This PR enhances S3 throughput by leveraging every available shard to upload backup files concurrently. By distributing the load across multiple shards, we significantly improve the upload performance. Each shard retrieves an SSTable and processes its files sequentially, ensuring efficient, file-by-file uploads. To prevent uncontrolled fiber creation and potential resource exhaustion, the backup task employs a directory semaphore from the sstables_manager. This mechanism helps regulate concurrency at the directory level, ensuring stable and predictable performance during large-scale backup operations. Refs #22460 fixes: #22520 ``` =========================================== Release build, master, smp-16, mem-32GiB Bytes: 2342880184, backup time: 9.51 s =========================================== Release build, this PR, smp-16, mem-32GiB Bytes: 2342891015, backup time: 1.23 s =========================================== ``` Looks like it is faster at least x7.7 No backport needed since it (native backup) is still unused functionality Closes scylladb/scylladb#23727 * github.com:scylladb/scylladb: backup: Add test for invalid endpoint backup_task: upload on all shards backup_task: integrate sharded storage manager for upload	2025-04-18 16:17:41 +03:00
Avi Kivity	6b415cfd4b	Merge 'managed_bytes: in the copy constructor, respect the target preferred allocation size' from Michał Chojnowski Commit `14bf09f447` added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer. But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too. But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes. In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator). In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments. Consequences of the bug: 1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2. 2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though). 3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory. There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew. But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments. If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation. Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072 Fixes https://github.com/scylladb/scylladb/issues/22941 Fixes https://github.com/scylladb/scylladb/issues/22389 Fixes https://github.com/scylladb/scylladb/issues/23781 This is a regression fix, should be backported to all affected releases. Closes scylladb/scylladb#23782 * github.com:scylladb/scylladb: managed_bytes_test: add a reproducer for #23781 managed_bytes: in the copy constructor, respect the target preferred allocation size	2025-04-17 21:14:10 +03:00
Pavel Emelyanov	ca2cc5e826	Merge 'test/cluster/test_read_repair: make incremental test work with tablets' from Botond Dénes There are two tests which test incremental read repair: one with row the other with partition tombstones. The tests currently force vnodes, by creating the test keyspace with {'enabled': false}. Even so, the tests were found to be flaky so one of them are marked for skip. This commit does the following changes: * Make the tests use tablets by creating the test keyspace with tablets. * Change the way the tests write data so it works with tablets: currently the tests use scylla-sstable write + upload but this won't work with tablets since upload with tablets implies --load-and-stream which means data is streamed to all replicas (no difference created between nodes). Switch to the classic stop-node + write to other replica with CL=ONE. * Remove the skip added to the partition-tombstone test variant. Fixes: #21179 Test improvement, no backport required. Closes scylladb/scylladb#23167 * github.com:scylladb/scylladb: wip test/cluster/test_read_repair: make incremental test work with tablets	2025-04-17 18:54:00 +03:00
Piotr Dulikowski	325a89638c	doc: changing topology when changing snitches is no longer supported Update the "How to Switch Snitches" document to indicate that changing topology (i.e. changing node's DC or rack) while changing the snitch is no longer supported. Remove a note which said that switching snitches is not supported with tablets. It was introduced because of the concern that switching a snitch might change DC or rack of the node, for which our current tablet load balancer is completely unprepated. Now that changing DC/rack is forbidden, there doesn't seem to be anything related to snitches which could cause trouble for tablets.	2025-04-17 16:22:58 +02:00
Piotr Dulikowski	796c8d1601	test: cluster: introduce test_no_dc_rack_change The test makes sure that changing the DC or rack in the snitch's configuration fails with an expected error.	2025-04-17 16:22:58 +02:00
Piotr Dulikowski	1791ae3581	storage_service: don't update DC/rack in update_topology_with_local_metadata The DC/rack are now immutable and cannot be changed after restart, so there is no need to update the node's system.topology entry with this information on restart.	2025-04-17 16:22:58 +02:00
Piotr Dulikowski	ce2fab7cce	main: make dc and rack immutable after bootstrap Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278	2025-04-17 16:22:26 +02:00
Tomasz Grabiec	1e407ab4d2	tablets: Equalize per-table balance when allocating tablets for a new table Fixes the following scenario: 1. Scale out adds new nodes to each rack 2. Table is created - all tablets are allocated to new nodes because they have low load 3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed We're wrong to try to equalize global load when allocating tablets, and we should equalize per-table load instead, and let background load balancing fix it in a fair way. It will add to the allocated storage imbalance, but: 1. The table is initially empty, so doesn't impact actual storage imbalance. 2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately. 3. If the table was created before imbalance was formed, we would end up in the same situation in the problematic scenario after the patch. 4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in. Before we have CPU-aware tablet allocation, and thus can prove we have CPU capacity on the small nodes, we should respect per-table balance as this is the way in which we achieve full CPU utilization. Fixes #23631	2025-04-17 16:01:23 +02:00
Tomasz Grabiec	2597a7e980	load_sketch: Tolerate missing tablet_map when selecting for a given table To simplify future usage in network_topology_strategy::add_tablets_in_dc() which invokes populate() for a given table, which may be both new and preexisitng.	2025-04-17 16:01:16 +02:00
Ernest Zaslavsky	b79ca5a1aa	backup: Add test for invalid endpoint * During the development phase, the backup functionality broke because we lacked a test that runs backup with an invalid endpoint. This commit adds a test to cover that scenario. * Add checking for the expected error to be propagated from failing/aborted backup	2025-04-17 16:31:43 +03:00
Benny Halevy	b7212620f9	backup_task: upload on all shards Use all shards to upload snapshot files to S3. By using the sharded sstables_manager_for_table infrastructure. Refs #22460 Quick perf comparison =========================================== Release build, master, smp-16, mem-32GiB Bytes: 2342880184, backup time: 9.51 s =========================================== Release build, this PR, smp-16, mem-32GiB Bytes: 2342891015, backup time: 1.23 s =========================================== Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Co-authored-by: Ernest Zaslavsky <ernest.zaslavsky@scylladb.com>	2025-04-17 16:31:42 +03:00
Piotr Dulikowski	dd2e507ece	test: cluster: remove test_snitch_change This test checked that it is possible to change DC/rack of a node during restart. This will become explicitly forbidden, so remove the test.	2025-04-17 13:51:22 +02:00
Aleksandra Martyniuk	e178bd7847	test: add test for getting tasks children Add test that checks whether the children of a virtual task will be properly gathered if a node is down.	2025-04-17 13:48:44 +02:00
Aleksandra Martyniuk	53e0f79947	tasks: check whether a node is alive before rpc Check whether a node is alive before making an rpc that gathers children infos from the whole cluster in virtual_task::impl::get_children.	2025-04-17 12:51:22 +02:00
Michał Chojnowski	6c1889f65c	managed_bytes_test: add a reproducer for #23781	2025-04-17 12:51:01 +02:00
Botond Dénes	8ac7c54d8b	Merge 'topology_coordinator: stop: await all background_action_holder:s' from Benny Halevy Add missing awaits for the rebuild_repair and repair background actions. Although the background actions hold the _async_gate which is closed in topology_coordinator::run(), stop() still needs to await all background action futures and handle any errors they may have left behind. Fixes #23755 * The issue exists since 6.2 Closes scylladb/scylladb#17712 * github.com:scylladb/scylladb: topology_coordinator: stop: await all background_action_holder:s topology_coordinator: stop: improve error messages topology_coordinator: stop: define stop_background_action helper	2025-04-17 12:10:29 +03:00
Kefu Chai	b0cbe86780	s3/client: define a constant for security credential resource instead of repeating it, let's define a consstant and reuse it. less repeatings this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23713	2025-04-17 11:51:15 +03:00
Kefu Chai	a33651b03e	db, service: do not include unused header these unused headers were flagged by clang-include-cleaner. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23735	2025-04-17 11:49:59 +03:00
Botond Dénes	33e383c557	scripts/pull_github_pr.sh: add argument parsing Instead of hardcoding PR_NUM=$1 and FORCE=$2. This current setup is not very flexible and one gets no feedback if the arguments are incorrect or not recognized. Add proper position-independent argument parsing using a classic while case loop. Closes scylladb/scylladb#23623	2025-04-17 11:49:15 +03:00
Nadav Har'El	84d4af1f0e	Merge 'Alternator batch rcu' from Amnon Heiman This series adds support for reporting consumed capacity in BatchGetItem operations in Alternator. It includes changes to the RCU accounting logic, exposing internal functionality to support batch-specific behavior, and adds corresponding tests for both simple and complex use cases involving multiple tables and consistency modes. Need backporting to 2025.1, as RCU and WCU are not fully supported Fixes #23690 Closes scylladb/scylladb#23691 * github.com:scylladb/scylladb: test_returnconsumedcapacity.py: test RCU for batch get item alternator/executor: Add RCU support for batch get items alternator/consumed_capacity: make functionality public	2025-04-17 10:08:16 +03:00
Botond Dénes	22a28ca1db	wip	2025-04-17 03:01:17 -04:00
Ernest Zaslavsky	a369dda049	s3_client: implement S3 copy object Add support for the CopyObject API to enable direct copying of S3 objects between locations. This approach eliminates networking overhead on the client side, as the operation is handled internally by S3.	2025-04-17 09:47:47 +03:00
Botond Dénes	19b4f10598	test/cluster/test_read_repair: make incremental test work with tablets There are two tests which test incremental read repair: one with row the other with partition tombstones. The tests currently force vnodes, by creating the test keyspace with {'enabled': false}. Even so, the tests were found to be flaky so one of them are marked for skip. This commit does the following changes: * Make the tests use tablets by creating the test keyspace with tablets. * Change the way the tests write data so it works with tablets: currently the tests use scylla-sstable write + upload but this won't work with tablets since upload with tablets implies --load-and-stream which means data is streamed to all replicas (no difference created between nodes). Switch to the classic stop-node + write to other replica with CL=ONE. * Remove the skip added to the partition-tombstone test variant. Also add tracing to the read-repair query, to make debugging the test easier if it fails. Fixes: #21179	2025-04-17 02:01:17 -04:00
Michał Chojnowski	4e2f62143b	managed_bytes: in the copy constructor, respect the target preferred allocation size Commit `14bf09f447` added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer. But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too. But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes. In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator). In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments. Consequences of the bug: 1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2. 2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though). 3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory. There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew. But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments. If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation. Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072 Fixes https://github.com/scylladb/scylladb/issues/22941 Fixes https://github.com/scylladb/scylladb/issues/22389 Fixes https://github.com/scylladb/scylladb/issues/23781	2025-04-16 22:06:06 +02:00
Nadav Har'El	6db666a1c1	replica: fix 10-second pause during shutdown As noticed in issue #23687, if we shut down Scylla while a paged read is in progress - or even a paged read that the client had no intention of ever resume it - the shutdown pauses for 10 seconds. The problem was the stop() order - we must stop the "querier cache" before we can close sstables - the "querier cache" is what holds paged readers alive waiting for clients to resume those reads, and while a reader is alive it holds on to sstables so they can't be closed. The querier cache's querier_cache::default_entry_ttl is set to 10 seconds, which is why the shutdown was un-paused after 10 seconds. This fix in this patch is obvious: We need to stop the querier cache (and have it release all the readers it was holding) before we close the sstables. Fixes #23687 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23770	2025-04-16 20:35:44 +03:00
Avi Kivity	0206da5232	Merge 'readers: strip "flat" and "v2" from names' from Botond Dénes Continue the effort of normalizing reader names, stripping legacy qualifying terms like "flat" and "v2". Flat and v2 readers are the default now, we only need to add qualifying terms to readers which are different than the normal. One such reader remains: `make_generating_reader_v1()`. This PR contains mostly mechanical changes, done with a sed script. Commits which only contain such mechanical renames are marked as such in the commitlog. Code cleanup, no backport needed. Closes scylladb/scylladb#23767 * github.com:scylladb/scylladb: readers: mv reversing_v2.hh reversing.hh readers: mv generating_v2.hh generating.hh tree: s/make_generating_reader_v2/make_generating_reader/ readers: mv from_mutations_v2.hh from_mutations.hh tree: s/make_mutation_reader_from_mutations_v2/make_mutation_reader_from_mutations/s readers: mv from_fragments_v2.hh from_fragments.hh readers: mv forwardable_v2.hh forwardable.hh readers: mv empty_v2.hh empty.hh tree: s/make_empty_flat_reader_v2/make_empty_mutation_reader/ readers/empty_v2.hh: replace forward declarations with include of fwd header readers/mutation_reader_fwd.hh: forward declare reader_permit readers: mv delegating_v2.hh delegating.hh readers/delegating_v2.hh: move reader definition to _impl.hh file	2025-04-16 20:21:51 +03:00
Ernest Zaslavsky	8929cb324e	s3_client: improve exception message Clarify that the multipart upload was aborted due to a failure in parsing ETags.	2025-04-16 18:58:22 +03:00
Ernest Zaslavsky	993953016f	s3_client: reposition local function for future use The local function has been relocated higher in the code to prepare for its usage in upcoming implementations.	2025-04-16 18:46:31 +03:00
Ernest Zaslavsky	428f673ca2	backup_task: integrate sharded storage manager for upload Introduce the sharded storage manager and use it to instantiate upload clients. Full functionality will be implemented in subsequent changes.	2025-04-16 18:18:58 +03:00
Amnon Heiman	3acde5f904	test_returnconsumedcapacity.py: test RCU for batch get item This patch adds tests for consumed capacity in batch get item. It tests both the simple case and the multi-item, multi-table case that combines consistent and non-consistent reads.	2025-04-16 17:05:32 +03:00
Pavel Emelyanov	8b2cababb6	generic_server: Don't mess with db::config The db::config is top-level configuration of scylla, we generally try to avoid using it even in scylla components: each uses its own config initialized by the service creator out of the db::config itself. The generic_server is not an exception, all the more so, it already has its own config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23705	2025-04-16 17:02:30 +03:00
Amnon Heiman	88095919d0	alternator/executor: Add RCU support for batch get items This patch adds RCU support for batch get items. With batch requests, multiple objects are read from multiple tables. While the criterion for adding the units is per the batch request, the units are calculated per table—and so is the read consistency.	2025-04-16 16:53:22 +03:00
Amnon Heiman	0eabf8b388	alternator/consumed_capacity: make functionality public The consumed_capacity_counter is not completely applicable for batch operations. This patch makes some of its functionality public so that batch get item can use the components to decide if it needs to send consumed capacity in the reply, to get the half units used by the metrics and returned result, and to allow an empty constructor for the RCU counter.	2025-04-16 16:49:40 +03:00
Benny Halevy	7a0f5e0a54	topology_coordinator: stop: await all background_action_holder:s Add missing awaits for the rebuild_repair and repair background actions. Although the background actions hold the _async_gate which is closed in topology_coordinator::run(), stop() still needs to await all background action futures and handle any errors they may have left behind. Fixes #23755 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-16 15:23:02 +03:00
Benny Halevy	6de79d0dd3	topology_coordinator: stop: improve error messages "when cleanup" is ill-formed. Use "when XYZ" to "during XYZ" instead. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-16 15:20:58 +03:00
Benny Halevy	d624795fda	topology_coordinator: stop: define stop_background_action helper Refactor the code to use a helper to await background_action_holder and handle any errors by printing a warning. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-16 15:20:39 +03:00
Botond Dénes	6172ff501f	readers: mv reversing_v2.hh reversing.hh Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	c8563b9604	readers: mv generating_v2.hh generating.hh Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	dfd7f03463	tree: s/make_generating_reader_v2/make_generating_reader/ Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	c29c696780	readers: mv from_mutations_v2.hh from_mutations.hh Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	b104862702	tree: s/make_mutation_reader_from_mutations_v2/make_mutation_reader_from_mutations/s Completely mechanical change.	2025-04-16 04:46:07 -04:00
Anna Stuchlik	0b4740f3d7	doc: add info about Scylla Doctor Automation to the docs Fixes https://github.com/scylladb/scylladb/issues/23642 Closes scylladb/scylladb#23745	2025-04-16 11:44:35 +03:00
Botond Dénes	7547d0c6a9	readers: mv from_fragments_v2.hh from_fragments.hh Completely mechanical change.	2025-04-16 04:35:00 -04:00
Botond Dénes	f1bd2553ed	readers: mv forwardable_v2.hh forwardable.hh Completely mechanical change.	2025-04-16 04:33:50 -04:00
Botond Dénes	a9d75c4f9d	readers: mv empty_v2.hh empty.hh Completely mechanical change.	2025-04-16 04:32:56 -04:00
Botond Dénes	05829f98f3	tree: s/make_empty_flat_reader_v2/make_empty_mutation_reader/ Completely mechanical change.	2025-04-16 04:32:56 -04:00
Botond Dénes	0e33f0d09e	readers/empty_v2.hh: replace forward declarations with include of fwd header	2025-04-16 04:12:08 -04:00
Botond Dénes	d75936d989	readers/mutation_reader_fwd.hh: forward declare reader_permit It is commonly used as parameter to reader factory methods.	2025-04-16 04:12:08 -04:00
Botond Dénes	7d9b91a00e	readers: mv delegating_v2.hh delegating.hh Completely mechanical change.	2025-04-16 04:11:55 -04:00
Botond Dénes	c7f68a2649	readers/delegating_v2.hh: move reader definition to _impl.hh file The idea behind readers/ is that each reader has its minimal header with just a factory method declaration. The delegating reader is defined in the factory header because it has a derived class in row_cache_test.cc. Move the definition to delegating_impl.hh so users not interested in deriving from it don't pay the price in header include cost.	2025-04-16 03:47:57 -04:00
Pavel Emelyanov	70ac5828a8	Update seastar submodule * seastar 099cf616...e44af9b0 (19): > Add assertion to `get_local_service` > http_client: Improve handling of server response parsing errors > util: include used header > core: Fix module linkage by using `inline constexpr` for shared constants > build: fix P2582R1 detection for GCC compiler compatibility > app-template: remove production warning > ioinfo: Extend printed data a bit more > reactor: Fix indentation after previous patch > reactor: Configure multiple mountpoints per disk > io_queue, resource, reactor: Rename dev_t -> unsigned > resource: Rename mountpoint to disk in resources > reactor: Keep queues as shared_ptr-s > io_queue: Drop device ID > io_intent: Use unsigned queue id as a key > io_queue: Keep unsigned queue id on an io_queue > file: Keep device_id on posix file impl > io_queue: Print mountpoint in latency goal bump message > io_intent: Rename qid to cid > reactor: Move engine()._num_io_groups assignment and check Changes in io-queue call for scylla-gdb update as well -- now the reactor map of device to io-queue uses seastar::shared_ptr, not std::unique_ptr. Closes scylladb/scylladb#23733	2025-04-16 09:44:37 +03:00
Botond Dénes	f5125ffa18	Merge 'Ensure raft group0 RPCs use the gossip scheduling group.' from Sergey Zolotukhin Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore. Fixes scylladb/scylladb#21637 Backport: 6.2 and 6.1 Closes scylladb/scylladb#22779 * github.com:scylladb/scylladb: Ensure raft group0 RPCs use the gossip scheduling group Move RAFT operations verbs to GOSSIP group.	2025-04-16 09:11:29 +03:00
Lakshmipathi	42ed6a87bf	test: Test truncate during topology change Add a new node, during topology change issue truncate call and verify all nodes empty data after tablet migration. Fixes: https://github.com/scylladb/scylla-dtest/issues/5317 Signed-off-by: Lakshmipathi Ganapathi <lakshmipathi.ganapathi@scylladb.com> Closes scylladb/scylladb#22595	2025-04-16 09:10:22 +03:00
Tomasz Grabiec	001d3b2415	Merge 'storage_service: preserve state of busy topology when transiting tablet' from Łukasz Paszkowski Commit `876478b84f` ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise. Restrict the state change to when the topology state machine is idle. In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way. Unit test: Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone. Fixes https://github.com/scylladb/scylladb/issues/20073. Commit `876478b84f` was first released in scylla-6.0.0, so we might want to backport this patch accordingly. Closes scylladb/scylladb#23751 * github.com:scylladb/scylladb: storage_service: add unit test for mid-decommission transit_tablet() storage_service: preserve state of busy topology when transiting tablet	2025-04-16 00:19:24 +02:00
Pavel Emelyanov	b79137eaa4	storage_service: Use this->_features directly This dependency is already there, storage service doesn't need to go rounds via database reference to get to the features. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23739	2025-04-15 21:11:12 +03:00
Tomasz Grabiec	d493a8d736	tests: tablets: Simplify tests by moving common code to topology_builder Reduces code duplication.	2025-04-15 16:05:41 +02:00
Laszlo Ersek	841ca652a0	storage_service: add unit test for mid-decommission transit_tablet() Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-04-15 15:15:25 +02:00
Michał Chojnowski	b3d951517d	test/scylla_gdb: generate a coredump when coro_task fails This test fails sometimes, but rarely and unreliably. We want to get a coredump from it the next time it fails. Sending a SIGSEGV should induce that. Refs https://github.com/scylladb/scylladb/issues/22501 Closes scylladb/scylladb#23256	2025-04-15 15:16:38 +03:00
Calle Wilund	abd2d8a58b	test_tools: Manual merge of local key gen tool test from enterprise Fixes scylladb/scylla-enterprise#5358 Transposed tool test for local file generator, originally java test. Then enterprise test. Now here. Closes scylladb/scylladb#23726	2025-04-15 15:14:08 +03:00
Laszlo Ersek	e1186f0ae6	storage_service: preserve state of busy topology when transiting tablet Commit `876478b84f` ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise. Restrict the state change to when the topology state machine is idle. In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-04-15 13:44:45 +02:00
Piotr Dulikowski	22e3b8eccd	Merge 'test/cqlpy: Adjust tests to RF-rack-valid keyspaces' from Dawid Mędrek In this PR, we adjust tests in the cqlpy test suite so they only use RF-rack-valid keyspaces. After that, we enable the configuration option `rf_rack_valid_keyspaces` in the suite by default. Refs scylladb/scylladb#23428 Backport: backporting to 2025.1 so we can test the option there too. Closes scylladb/scylladb#23489 * github.com:scylladb/scylladb: test/cqlpy: Enable rf_rack_valid_keyspaces by default test: Move test_alter_tablet_keyspace_rf to cluster suite test/cqlpy: Adjust tests to RF-rack-valid keyspaces test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces	2025-04-15 12:43:11 +02:00
Avi Kivity	b4d4e48381	scylla-gdb: small-objects: fix for very small objects Because of rounding and alignment, there are multiple pools for small sizes (e.g. 4 for size 32). Because the pool selection algorithm ignores alignment, different pools can be chosen for different object sizes. For example, an object size of 29 will choose the first pool of size 32, while an object size of 32 will choose the fourth pool of size 32. The small-objects command doesn't know about this and always considers just the first pool for a given size. This causes it to miss out on sister pools. While it's possible to adjust pool selection to always choose one of the pools, it may eat a precious cycle. So instead let's compensate in the small-objects command. Instead of finding one pool for a given size, find all of them, and iterate over all those pools. Fixes #23603 Closes scylladb/scylladb#23604	2025-04-15 11:16:52 +03:00
Emil Maskovsky	3930ee8e3c	raft: fix data center remaining nodes initialization The `_remaining_nodes` attribute of the data center information was not initialized correctly. The parameter was passed by value to the initialization function instead of by reference or pointer. As a result, `_remaining_nodes` was left initialized to zero, causing an underflow when decrementing its value. This bug did not significantly impact behavior because other safeguards, such as capping the maximum voters per data center by the total number of nodes, masked the issue. However, it could lead to inefficiencies, as the remaining nodes check would not trigger correctly. Fixes: scylladb/scylladb#23702 No backport: The bug is only present in the master branch, so no backport is required. Closes scylladb/scylladb#23704	2025-04-15 09:58:32 +02:00
Nadav Har'El	fbcf77d134	raft: make group0 Raft operation timeout configurable A recent commit `370707b111` (re)introduced a timeout for every group0 Raft operation. This timeout was set to 60 seconds, which, paraphrasing Bill Gates, "ought to be enough for anybody". However, one of the things we do as a group0 operation is schema changes, and we already noticed a few years ago, see commit `0b2cf21932`, that in some extremely overloaded test machines where tests run hundreds of times (!) slower than usual, a single big schema operation - such as Alternator's DeleteTable deleting a table and multiple of its CDC or view tables - sometimes takes more than 60 seconds. The above fix changed the client's timeout to wait for 300 seconds instead of 60 seconds, but now we also need to increase our Raft timeout, or the server can time out. We've seen this happening recently making some tests flaky in CI (issue #23543). So let's make this timeout configurable, as a new configuration option group0_raft_op_timeout_in_ms. This option defaults to 60000 (i.e, 60 seconds), the same as the existing default. The test framework overrides this default with a a higher 300 second timeout, matching the client-side timeout. Before this patch, this timeout was already configurable in a strange way, using injections. But this was a misstep: We already have more than a dozen timeouts configurable through the normal configration, and this one should have been configured in the same way. There is nothing "holy" about the default of 60 seconds we chose, and who knows maybe in the future we might need to tweek it in the field, just like we made the other timeouts tweakable. Injections cannot be used in release mode, but configuration options can. Fixes #23543 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23717	2025-04-15 10:57:39 +03:00
Kefu Chai	3e3f583b84	docs/dev/tombstone.md: fix a typo s/alwas/always/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23734	2025-04-15 10:54:42 +03:00
Avi Kivity	5e1cf90a51	build: replace tools/java submodule with packaged cassandra-stress We no longer use tools/java (scylladb/scylla-tools-java.git) for nodetool or cqlsh; only cassandra-stress. Since that is available in package form install that and excise the tools/java submodule from the source tree. pgo/ is adjusted to use the packaged cassandra-stress (and the cqlsh submodule). A few jmx references are dropped as well. Frozen toolchain regenerated. Optimized clang from https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz Closes scylladb/scylladb#23698	2025-04-15 10:11:28 +03:00
Jenkins Promoter	9699c3ded4	Update pgo profiles - aarch64	2025-04-15 04:45:34 +03:00
Jenkins Promoter	8472aa9e53	Update pgo profiles - x86_64	2025-04-15 04:29:24 +03:00
Pavel Emelyanov	b25cb5af0c	Merge 'Use named gates' from Benny Halevy Name the gates and phased barriers we use to make it easy to debug gate_closed_exception Refs https://github.com/scylladb/seastar/pull/2688 * Enhancement only, no backport needed Closes scylladb/scylladb#23329 * github.com:scylladb/scylladb: utils: loading_cache: use named_gate utils: flush_queue: use named_gate sstables_manager: use named gate sstables_loader: use named gate utils: phased_barrier, pluggable: use named gate utils: s3::client::multipart_upload: use named gate utils: s3::client: use named_gate transport: controller: use named gate tracing: trace_keyspace_helper: use named gate task_manager: module: use named gate topology_coordinator: use named gate storage_service: use named gate storage_proxy: wait_for_hint_sync_point: use named gate storage_proxy: remote: use named gate service: session: use named gate service: raft: raft_rpc: use named gate service: raft: raft_group0: use named gate service: raft: persistent_discovery: use named gate service: raft: group0_state_machine: use named gate service: migration_manager: use named gate replica: table: use named gate replica: compaction_group, storage_group: use named gate redis: query_processor: use named gate repair: repair_meta: use named gate reader_concurrency_semaphore: use named gate raft: server_impl: use named gate querier_cache: use named gate gms: gossiper: use named gate generic_server: use named gate db: sstables_format_listener: use named gate db: snapshot: backup_task: use named gate db: snapshot_ctl: use named gate hints: hints_sender: use named gate hints: manager: use named gate hints: hint_endpoint_manager: use named gate commitlog: segment_manager: use named gate db: batchlog_manager: use named gate query_processor: remote: use named gate compaction: compaction_state: use named gate alternator/server: use named_gate	2025-04-14 20:56:32 +03:00
Sergey Zolotukhin	e05c082002	Ensure raft group0 RPCs use the gossip scheduling group Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For Raft group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This commit adds a check to ensure that the raft group0 RPCs are executed with the `gossiper` scheduling group.	2025-04-14 17:10:46 +02:00
Sergey Zolotukhin	60f1053087	Move RAFT operations verbs to GOSSIP group. In order for RAFT operations to use the gossip system semaphore, moving RAFT verbs to the gossip group in `do_get_rpc_client_idx`, messaging_service. Fixes scylladb/scylladb21637	2025-04-14 17:09:49 +02:00
Pavel Emelyanov	1bd991a111	test: Inherit sstable_assertions from sstables::test The latter class is invented to let tests access private fields of an sstable (mostly methods). The former is in fact an extended version of that also does some checks. Howerver, they don't inherit from each other, and the sstable_assertions partially duplicates some funtionality of the test one. Add the inheritance, remove the duplicated methods from the child class, update the callers (the test class returns future<>s, the assertions one "knows" it runs in seastar thread) and marm sstable::read_toc() private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23697	2025-04-14 13:45:14 +03:00
Kefu Chai	b3f709bed7	s3: remove an extraneous space Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23714	2025-04-14 13:02:58 +03:00
Michał Chojnowski	6e2795a843	Update seastar submodule * seastar ed8952fb...099cf616 (10): > reactor: Disable hot polling if wakeup granularity is too high > smp: add shard_to_numa_node_mapping() > tests/unit/httpd_test: fix the handling of NUL bytes in the parser > fstream: skip allocation in no write_behinds case > `http`: add `xml` support to `http::mime_types::mappings` > Print incrementally in sigsegv handler > reactor: use 0x for hex addresses > tls: Make session resume key shared across credentials builders creds > build: fix CMAKE_REQUIRED_FLAGS format for sanitizer detection > reactor: Remove sched_debug() related code Closes scylladb/scylladb#23703	2025-04-14 12:54:19 +03:00
Andrei Chekun	8e33d7ab81	test.py: Make the testpy log files in pytest follow the same format Fix the incorrect log file names between conftest and scylla_manager. This regression issue, was introduced in #22960. Currently, scylla manager will output it's logs to the file with the next pattern: suite_name.path_to_the_test_file_with_subfolders.run_id.function_name.mode.run_id_cluster.log On the same time pytest will try to find this log with next name: suite_name.file_name_without_subfolders_path.py.run_id.function_name.mode.run_id_cluster.log This inconsistency leads to the situation when the test failed, scylla manager log file will not be copied to the failed_test directory and test will have exception on teardown. Closes scylladb/scylladb#23596	2025-04-14 12:52:48 +03:00
Evgeniy Naydanov	d6b64642c5	test.py: print out path to Scylla log for Python test suites Test suites with `type: Python` are using single Scylla node created by test.py, but it's handy to print a path to a log file in pytest log too to make it easier to find the file on failures. Closes scylladb/scylladb#23683	2025-04-14 11:15:37 +03:00
Kefu Chai	69de816b1b	scylla-gdb.py: fix a typo in gdb command description replace "runnign" with "running". Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23716	2025-04-14 10:59:21 +03:00
Benny Halevy	8d7e4d6c36	utils: loading_cache: use named_gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:09 +03:00
Benny Halevy	46f2a24772	utils: flush_queue: use named_gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:02 +03:00
Benny Halevy	d665bb4f8b	sstables_manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	7969293dcf	sstables_loader: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	e1fe82ed33	utils: phased_barrier, pluggable: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	d3f498ae59	utils: s3::client::multipart_upload: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	eea83464c7	utils: s3::client: use named_gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:46:51 +03:00
Benny Halevy	79e967e2f5	transport: controller: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:29:48 +03:00
Benny Halevy	3d87b67d0e	tracing: trace_keyspace_helper: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:29:48 +03:00
Benny Halevy	bfdd8a98ca	task_manager: module: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:29:48 +03:00
Benny Halevy	5e864b6277	topology_coordinator: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:29:46 +03:00
Benny Halevy	a67ed59399	storage_service: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	39f1175451	storage_proxy: wait_for_hint_sync_point: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	e228a112fe	storage_proxy: remote: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	0a1e7de6ea	service: session: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	747446cb25	service: raft: raft_rpc: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	01bb3980fc	service: raft: raft_group0: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	6118150d44	service: raft: persistent_discovery: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	e430df6332	service: raft: group0_state_machine: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	5f8b5724e6	service: migration_manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	7342a57cbb	replica: table: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	52e1ce7f0d	replica: compaction_group, storage_group: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	aff6017e83	redis: query_processor: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	80b5089d0c	repair: repair_meta: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	679e73053f	reader_concurrency_semaphore: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	9724d87e86	raft: server_impl: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	5780599eec	querier_cache: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	cecfb6dfd7	gms: gossiper: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	bc69bc3de7	generic_server: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	5a71763d75	db: sstables_format_listener: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	da492231df	db: snapshot: backup_task: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	edf497c170	db: snapshot_ctl: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	c5d7272393	hints: hints_sender: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	1c1adb3d60	hints: manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	4c475a1905	hints: hint_endpoint_manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	bdd5a61139	commitlog: segment_manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	0672c9da5c	db: batchlog_manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	f8d5835cab	query_processor: remote: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	747ae5e1c4	compaction: compaction_state: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	879811e0d2	alternator/server: use named_gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Dawid Mędrek	be0877ce69	test/cqlpy: Enable rf_rack_valid_keyspaces by default All of the tests in the suite have been adjusted so they only use RF-rack-valid keyspaces, so let's start enabling the option by default.	2025-04-11 14:55:13 +02:00
Dawid Mędrek	a59842257a	test: Move test_alter_tablet_keyspace_rf to cluster suite We move the test `test_alter_tablet_keyspace_rf` from the cqlpy to the cluster test suite. The reason behind the change is that the test cannot be run with `rf_rack_valid_keyspaces` turned on in the configuration. During the test, we make the keyspace RF-rack-invalid multiple times. Since RF-rack-validity is a very strong constraint, adjust the test otherwise is impossible. By moving it to the cluster test suite, we're able to change the configuration of the node used in the test, and so the test can work again.	2025-04-11 14:55:11 +02:00
Dawid Mędrek	958eaec056	test/cqlpy: Adjust tests to RF-rack-valid keyspaces	2025-04-11 14:55:04 +02:00
Dawid Mędrek	6bde01bb59	test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces We adjust three existing Cassandra tests so that they don't create RF-rack-invalid keyspaces. We modify the replication factor used in the problematic tests. The changes don't affect the tests as the value of the RF is unrelated to what they verify. Thanks to that, we can run them now even with enforced RF-rack-valid keyspaces. The drawback is that the modified ALTER statements do not modify the RF at all. However, since the tests seem to verify that the code responsible for VALIDATING a request works as intended, that should have little to no impact on them.	2025-04-11 14:20:14 +02:00
Dawid Mędrek	10589e966f	test/cluster/mv: Adjust test to RF-rack-valid keyspaces We adjust the test in the directory so that all of the used keyspaces are RF-rack-valid throughout the their execution. Refs scylladb/scylladb#23428 Closes scylladb/scylladb#23490	2025-04-11 14:03:21 +02:00
Karol Baryła	df64985a4e	Docs: Describe driver issue with tablet RF increase Current protocol extension that sends tablet info to drivers only does that if the driver selects a non-replica coordinator for a routable request. It works well if some node on the replica list is replaced by other node, or if some replicas are removed from the list. Driver will at some point send a request to stale replica, and receive new list in response. The issue is with extending the list with new replicas. In that case old replicas are all still correct, so driver will not select any wrong replica, and will not receive the new list. As far as I know that only scenario where this could happen is RF increase. It could be to some degree worked around in the drivers, but it would add significant complexity (definitely more than any other invalidations we introduced) while still not being ideal solution. This scenario should be rare enough, and the consequences of not handling it minor enough (new replicas not being used as coordinators) that it does not warrant driver-side solution. Instead this commit adds info about this to documentation, advising users to restart applications after replica lists are extended. It is worth noting that if new tablet feedback protocol extension is implemented then this problem goes away. See issue #21664. Closes scylladb/scylladb#23447	2025-04-11 13:48:40 +02:00
David Garcia	cf11d5eb69	fix: openapi not rendering in docs.scylladb.com/manual Closes scylladb/scylladb#23686	2025-04-10 17:47:58 +03:00
Patryk Jędrzejczak	07a7a75b98	Merge 'raft: implement the limited voters feature' from Emil Maskovsky Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated). The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks. After each node addition or removal the voters are recalculated and rebalanced if necessary. That means: * When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked). * When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs). * If a node addition or removal causes a change in number of data centers (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution) Special conditions for various number of DCs: * 1 DC: Can have up to the maximum allowed number of voters (5 - see below) * 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters. * 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution. At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR. The maximum number of voters of 5 has been chosen as the smallest "safe" value. We can lose majority when multiple nodes (possibly in different dcs and racks) die independently in a short time span. With less than 5 voters, we would lose majority if 2 voters died, which is very unlikely to happen but not entirely impossible. With 5 voters, at least 3 voters must die to lose majority, which can be safely considered impossible in the case of independent failures. Currently the limit will not be configurable (we might introduce configurable limits later if that would be needed/requested). Tests added: * boost/group0_voter_registry_test.cc: run time on CI: ~3.5s * topology_custom/test_raft_voters.py: parametrized with 1 or 3 nodes per DC, the run time on CI: 1: ~20s. 3: ~40s, approx 1 min total Fixes: scylladb/scylladb#18793 No backport: This is a new feature that will not be backported. Closes scylladb/scylladb#21969 * https://github.com/scylladb/scylladb: raft: distribute voters by rack inside DC raft/test: fix lint warnings in `test_raft_no_quorum` raft/test: add the upgrade test for limited voters feature raft topology: handle on_up/on_down to add/remove node from voters raft: fix the indentation after the limited voters changes raft: implement the limited voters feature raft: drop the voter removal from the decommission raft/test: disable the `stop_before_becoming_raft_voter` test raft/test: stop the server less gracefully in the voters test	2025-04-10 15:29:15 +02:00
Avi Kivity	9559e53f55	Merge 'Adjust tablet-mon.py for capacity-aware load balancing' from Tomasz Grabiec After load-balancer was made capacity-aware it no longer equalizes tablet count per shard, but rather utilization of shard's storage. This makes the old presentation mode not useful in assessing whether balance was reached, since nodes with less capacity will get fewer tablets when in balanced state. This PR adds a new default presentation mode which scales tablet size by its storage utilization so that tablets which have equal shard utilization take equal space on the graph. To facilitate that, a new virtual table was added: system.load_per_node, which allows the tool to learn about load balancer's view on per-node capacity. It can also serve as a debugging interface to get a view of current balance according to the load-balancer. Closes scylladb/scylladb#23584 * github.com:scylladb/scylladb: tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization tablet-mon.py: Center tablet id text properly in the vertical axis tablet-mon.py: Show migration stage tag in table mode only when migrating virtual-tables: Introduce system.load_per_node virtual_tables: memtable_filling_virtual_table: Propagate permit to execute() docs: virtual-tables: Fix instructions service: tablets: Keep load_stats inside tablet_allocator	2025-04-10 14:59:08 +03:00
Avi Kivity	885838fc46	Merge 'scylla-gdb.py: improve scylla repairs command' from Botond Dénes Make output more readable by: * group follower/master repair instances separately * split repair details into one line for repair summary, then one line for each host info * add indentation to make the output easier to follow Also add `-m\|--memory` option to calculate memory usage of repair buffers. Example output: (gdb) scylla repairs -m Repairs for which this node is leader: (repair_meta) 0x60503ab7f7b0: {id: 19197, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 30, memory: 48208512}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started (repair_meta) 0x60503717f7b0: {id: 19211, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 63863265}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: c4936a19-41da-4260-971e-651445d740fd, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished (repair_meta) 0x60502ddff7b0: {id: 19231, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::row_level_stop_started host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::row_level_stop_finished (repair_meta) 0x60501db3f7b0: {id: 19234, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_sync_boundary_started host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::get_sync_boundary_finished (repair_meta) 0x60501c81f7b0: {id: 19236, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 42696821}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started (repair_meta) 0x60503f65f7b0: {id: 19238, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 47785163}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished Repairs for which this node is follower: Closes scylladb/scylladb#23075 * github.com:scylladb/scylladb: scylla-gdb.py: improve scylla repairs commadn scylla-gdb.py: seastar_lw_shared_ptr: add __nonzero__ and __bool__ scylla-gdb.py: introduce managed_bytes	2025-04-10 14:52:43 +03:00
Dani Tweig	e92740cc2b	.github: update bug_report.yml Perform a yaml "face lift" on the old bug report md template, making bug reporting more efficient. - Add dedicated textarea fields for problem description and expected behavior - Include pre-filled placeholders to guide issue reporting - Add formatted log output section with shell syntax highlighting Closes: #21532	2025-04-10 14:26:00 +03:00
Pavel Emelyanov	88318d3b50	topology_coordinator: Use shorter fault-injection overloads There are few places that want to pause until a message is received from the test. There's a convenience one-line suger to do it. One test needs update its expectations about log message that appears when scylle steps on it and actually starts waiting. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23390	2025-04-10 14:05:46 +03:00
Botond Dénes	d67202972a	mutation/frozen_mutation: frozen_mutation_consumer_adaptor: fix end-of-partition handling This adaptor adapts a mutation reader pausable consumer to the frozen mutation visitor interface. The pausable consumer protocol allows the consumer to skip the remaining parts of the partition and resume the consumption with the next one. To do this, the consumer just has to return stop_iteration::yes from one of the consume() overloads for clustering elements, then return stop_iteration::no from consume_end_of_partition(). Due to a bug in the adaptor, this sequence leads to terminating the consumption completely -- so any remaining partitions are also skipped. This protocol implementation bug has user-visible effects, when the only user of the adaptor -- read repair -- happens during a query which has limitations on the amount of content in each partition. There are two such queries: select distinct ... and select ... with partition limit. When converting the repaired mutation to to query result, these queries will trigger the skip sequence in the consumer and due to the above described bug, will skip the remaining partitions in the results, omitting these from the final query result. This patch fixes the protocol bug, the return value of the underlying consumer's consume_end_of_partition() is now respected. A unit test is also added which reproduces the problem both with select distinct ... and select ... per partition limit. Follow-up work: * frozen_mutation_consumer_adaptor::on_end_of_partition() calls the underlying consumer's on_end_of_stream(), so when consuming multiple frozen mutations, the underlying's on_end_of_stream() is called for each partition. This is incorrect but benign. * Improve documentation of mutation_reader::consume_pausable(). Fixes: #20084 Closes scylladb/scylladb#23657	2025-04-10 13:19:57 +03:00
Pavel Emelyanov	4de48a9d24	encryption: Mark parts of encrypted_data_sink private Nowadays the whole class is public, but it's not in fact such. Remove the SUDDENLY unused private _flush_pos member to please the compiler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23677	2025-04-10 12:42:57 +03:00
Dawid Mędrek	0ed21d9cc1	test/cluster/test_tablets.py: Fix test errorneous indentation Some of the statements in the test are not indented properly and, as a result, are never run. It's most likely a small mistake, so let's fix it. Closes scylladb/scylladb#23659	2025-04-10 11:06:01 +03:00
Nadav Har'El	258213f73b	Merge 'Alternator batch count histograms' from Amnon Heiman This series adds a histogram for get and write batch sizes. It uses the estimated_histogram implementation which starts from 1 with 1.2 exponential factor, which works extremely tight to 20 but still covers all the way to 100. Histograms will be reported per node. Backport to 2025.1 so we'll have information about user batch size limitation Closes scylladb/scylladb#23379 * github.com:scylladb/scylladb: alternator: Add tests for the batch items histograms alternator: Add histogram for batch item count	2025-04-09 22:41:14 +03:00
Tomasz Grabiec	b5211cca85	Merge 'tablets: rebuild: use repair for tablet rebuild' from Aleksandra Martyniuk Currently, when we rebuild a tablet, we stream data from all replicas. This creates a lot of redundancy, wastes bandwidth and CPU resources. In this series, we split the streaming stage of tablet rebuild into two phases: first we stream tablet's data from only one replica and then repair the tablet. Fixes: https://github.com/scylladb/scylladb/issues/17174. Needs backport to 2025.1 to prevent out of space during streaming Closes scylladb/scylladb#23187 * github.com:scylladb/scylladb: test: add test for rebuild with repair locator: service: move to rebuild_v2 transition if cluster is upgraded locator: service: add transition to rebuild_repair stage for rebuild_v2 locator: service: add rebuild_repair tablet transition stage locator: add maybe_get_primary_replica locator: service: add rebuild_v2 tablet transition kind gms: add REPAIR_BASED_TABLET_REBUILD cluster feature	2025-04-09 21:35:37 +02:00
Avi Kivity	ed3e4f33fd	Merge 'generic_server: throttle and shed incoming connections according to semaphore limit' from Marcin Maliszkiewicz Adds new live updatable config: uninitialized_connections_semaphore_cpu_concurrency. It should help to reduce cpu usage by limiting cpu concurrency for new connections. As a last resort when those connections are waiting for initial processing too long (over 1m) they are shed. New connections_shed and connections_blocked metrics are added for tracking. Testing: - manually via simple program creating high number of connection and constantly re-connecting - added benchmark Following are benchmark results: Before: ``` > build/release/test/perf/perf_generic_server --smp=1 170101.41 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4695 insns/op, 3178 cycles/op, 0 errors) [...] throughput: mean=173850.06 standard-deviation=1844.48 median=174509.66 median-absolute-deviation=874.23 maximum=175087.49 minimum=170588.54 instructions_per_op: mean=4725.59 standard-deviation=13.35 median=4729.38 median-absolute-deviation=12.49 maximum=4738.61 minimum=4709.96 cpu_cycles_per_op: mean=3135.08 standard-deviation=32.13 median=3122.68 median-absolute-deviation=22.29 maximum=3179.38 minimum=3103.15 ``` After: ``` > build/release/test/perf/perf_generic_server --smp=1 167373.19 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4821 insns/op, 3371 cycles/op, 0 errors) [...] throughput: mean= 171199.55 standard-deviation=2484.58 median= 171667.06 median-absolute-deviation=2087.63 maximum=173689.11 minimum=167904.76 instructions_per_op: mean= 4801.90 standard-deviation=16.54 median= 4796.78 median-absolute-deviation=9.32 maximum=4830.71 minimum=4789.81 cpu_cycles_per_op: mean= 3245.26 standard-deviation=32.28 median= 3230.44 median-absolute-deviation=16.52 maximum=3297.39 minimum=3215.62 ``` The patch adds around 67 insns/op so it's effect on performance should be negligible. Fixes: https://github.com/scylladb/scylladb/issues/22844 Closes scylladb/scylladb#22828 * github.com:scylladb/scylladb: transport: move on_connection_close into connection destructor test: perf: make aggregated_perf_results formatting more human readable transport: add blocked and shed connection metrics generic_server: throttle and shed incoming connections according to semaphore limit generic_server: add data source and sink wrappers bookkeeping network IO generic_server: coroutinize part of server::do_accepts test: add benchmark for generic_server test: perf: add option to count multiple ops per time_parallel iteration generic_server: add semaphore for limiting new connections concurrency generic_server: add config to the constructor generic_server: add on_connection_ready handler	2025-04-09 21:41:38 +03:00
Tomasz Grabiec	5b5ada1743	tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization Per-node capacity is queried from system.load_per_node Tablet height in each node is scaled so that equal height = equal node utilization. The nominal height is assigned to the node which has the smallest capacity, so nodes with higher capacity will have smaller tablets than normal.	2025-04-09 20:21:51 +02:00
Tomasz Grabiec	217184f16b	tablet-mon.py: Center tablet id text properly in the vertical axis Was too low due to not subtracting frame size from height	2025-04-09 20:21:51 +02:00
Tomasz Grabiec	20cac72056	tablet-mon.py: Show migration stage tag in table mode only when migrating It's the gray bar at the top of the tablet. It's not showing useful information when tablet is not migrating.	2025-04-09 20:21:51 +02:00
Tomasz Grabiec	0b9a75d7b6	virtual-tables: Introduce system.load_per_node Can be used to query per-node stats about load as seen by the load balancer. In particular, node's capacity will be used by tablet-mon.py to scale tablet columns so that equal height is equal node utilization.	2025-04-09 20:21:51 +02:00
Tomasz Grabiec	668094dc58	virtual_tables: memtable_filling_virtual_table: Propagate permit to execute() So that population can access read's timeout and mark the permit as awaiting.	2025-04-09 20:21:51 +02:00
Tomasz Grabiec	34beaa30b5	docs: virtual-tables: Fix instructions	2025-04-09 20:21:51 +02:00
Tomasz Grabiec	76bc11c78c	service: tablets: Keep load_stats inside tablet_allocator So that virtual tables can pick them up. It's a better place to keep them than in topology_coordinator.	2025-04-09 20:21:51 +02:00
Pavel Emelyanov	d9853efa7c	Merge '[Out-of-space prevention] db: backup: prioritize sstables that were deleted from the table' from Benny Halevy The motivation behind this change to free up disk space as early as possible. The reason is that snapshot locks the space of all SSTables in the snapshot, and deleting form the table, for example, by compaction, or tablet migration, won't free-up their capacity until they are uploaded to object storage and deleted from the snapshot. This series adds prioritization of deleted sstables in two cases: First, after the snapshot dir is processed, the list of SSTable generation is cross-referenced with the list of SSTables presently in the table and any generation that is not in the table is prioritized to be uploaded earlier. In addition, a subscription mechanism was added to sstables_manager and it is used in backup to prioritize SSTables that get deleted from the table directory during backup. This is particularly important when backup happens during high disk utilization (e.g. 90%). Without it, even if the cluster is scaled up and tablets are migrated away from the full nodes to new nodes, tablet cleanup might not free any space if all the tablet sstables are hardlinked to the snapshot taken for backup. * Enhancement, no backport needed Closes scylladb/scylladb#23241 * github.com:scylladb/scylladb: db: snapshot: backup_task: prioritize sstables deleted during upload sstables_manager: add subscriptions db: snapshot: backup_task: limit concurrency sstables: directory_semaphore: expose get_units db: snapshot: backup_task: add sharded sstables_manager database: expose get_sstables_manager(schema) db: snapshot: backup_task: do_backup: prioritize sstables that are already deleted from the table db: snapshot-ctl: pass table_id to backup_task db: snapshot-ctl: expose sharded db() getter db: snapshot: backup_task: do_backup: organize components by sstable generation db: snapshot: coroutinize backup_task db: snapshot: backup_task: refactor backup_file out of uploads_worker db: snapshot: backup_task: refactor uploads_worker out of do_backup db: snapshot: backup_task: process_snapshot_dir: initialize total progress utils/s3: upload_progress: init members to 0 db: snapshot: backup_task: do_backup: refactor process_snapshot_dir db: snapshot: backup_task: keep expection as member	2025-04-09 15:32:11 +03:00
Marcin Maliszkiewicz	ce18909688	transport: move on_connection_close into connection destructor To make the code more robust by ensuring closing code is always executed.	2025-04-09 13:50:19 +02:00
Pavel Emelyanov	35dfc8c782	Merge 'audit: add semaphore to audit_syslog_storage_helper' from Andrzej Jackowski audit_syslog_storage_helper::syslog_send_helper uses Seastar's net::datagram_channel to write to syslog device (usually /dev/log). However, datagram_channel.send() is not fiber-safe (ref seastar#2690), so unserialized use of send() results in packets overwriting its state. This, in turn, causes a corruption of audit logs, as well as assertion failures. To workaround the problem, a new semaphore is introduced in audit_syslog_storage_helper. As storage_helper is a member of sharded audit service, the semaphore allows for one datagram_channel.send() on each shard. Each audit_syslog_storage_helper stores its own datagram_channel, therefore concurrent sends to datagram_channel are eliminated. This change: - Moved syslog_send_helper to audit_syslog_storage_helper - Corutinize audit_syslog_storage_helper - Introduce semaphore with count=1 in audit_syslog_storage_helper. See https://github.com/scylladb/scylla-dtest/pull/5749 for releated dtest Fixes: scylladb#22973 Backport to 2025.1 should be considered, as https://github.com/scylladb/scylladb/issues/22973 is known to cause crashes of 2025.1. Closes scylladb/scylladb#23464 * github.com:scylladb/scylladb: audit: add semaphore to audit_syslog_storage_helper audit: corutinize audit_syslog_storage_helper audit: moved syslog_send_helper to audit_syslog_storage_helper	2025-04-09 12:39:06 +03:00
Marcin Maliszkiewicz	619944555f	test: perf: make aggregated_perf_results formatting more human readable Before: throughput: mean=170728.58 standard-deviation=1921.76 median=171084.16 median-absolute-deviation=1501.58 maximum=172913.36 minimum=167288.97 instructions_per_op: mean=4685.89 standard-deviation=12.46 median=4683.92 median-absolute-deviation=9.68 maximum=4706.53 minimum=4666.70 cpu_cycles_per_op: mean=3090.94 standard-deviation=52.69 median=3103.43 median-absolute-deviation=24.55 maximum=3192.99 minimum=3003.00 After: throughput: mean= 168224.81 standard-deviation=854.48 median= 168829.02 median-absolute-deviation=604.21 maximum=168829.02 minimum=167620.60 instructions_per_op: mean= 4837.02 standard-deviation=20.89 median= 4851.79 median-absolute-deviation=14.77 maximum=4851.79 minimum=4822.24 cpu_cycles_per_op: mean= 3271.42 standard-deviation=46.29 median= 3304.16 median-absolute-deviation=32.73 maximum=3304.16 minimum=3238.69	2025-04-09 10:49:20 +02:00
Marcin Maliszkiewicz	599f4d312b	transport: add blocked and shed connection metrics This adds some visibility into connection storm mitigations added in following commits.	2025-04-09 10:49:18 +02:00
Marcin Maliszkiewicz	26518704ab	generic_server: throttle and shed incoming connections according to semaphore limit If we have uninitialized_connections_semaphore_cpu_concurrency (default 2) connections being processed we start delay accepting new connections. Connections which are in network IO state are not counted towards this limit and they can go to cpu phase without blocking. So it can happen that we process more concurrent new connections but that's a necessary tradeof to make progress during storm without implementing more advanced machinery (i.e. priority queue).	2025-04-09 10:48:51 +02:00
Marcin Maliszkiewicz	9f5de2c256	generic_server: add data source and sink wrappers bookkeeping network IO They release semaphore units when we start network IO and acquire it when we enter cpu intensive phase. We use consume() so it doesn't block because we don't want connections we started processing to compete with new incomming connections. Otherwise during connection storm we wouldn't make much progress. There will be a simplification here as we'll treat disc IO (if there is any) as cpu work.	2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz	c56116372e	generic_server: coroutinize part of server::do_accepts	2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz	719d04d501	test: add benchmark for generic_server Changes in configure.py are needed becuase we don't want to embed this benchmark in scylla binary as perf_simple_query or perf_alternator, it doesn't directly translate to Scylla performance but we want to use aggregated_perf_results for precise cpu measurements so we need different dependecies.	2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz	b957cedace	test: perf: add option to count multiple ops per time_parallel iteration	2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz	ed82bede39	generic_server: add semaphore for limiting new connections concurrency It will be used in following commits.	2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz	33122d3f93	generic_server: add config to the constructor	2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz	474e84199c	generic_server: add on_connection_ready handler This patch cleans the code a bit so that ready state is set in a single place. And adds handler which will allow adding logic when connection is made ready, this will be added in the following commits.	2025-04-09 10:30:58 +02:00
Benny Halevy	1ab3ec061b	db: snapshot: backup_task: prioritize sstables deleted during upload subscribe on each shard's sstables_manager to get callback notifications and keep the generation numbers of deleted sstables in a vector so they can be prioritized first to free up their disk space as soon as possible. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	d8b0c661e4	sstables_manager: add subscriptions Allow other submodules to subscribe for added/deleted notifications. This will be used in a later to patch to prioritize unlinked sstables for backup. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	d3b4874ec3	db: snapshot: backup_task: limit concurrency Otherwise, once all the background tasks are created we have no way to reorder the queue. Fixes #23239 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	e60fcc58b7	sstables: directory_semaphore: expose get_units To be used by a following patch for backup concurrency control. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	b7807ec165	db: snapshot: backup_task: add sharded sstables_manager Get a reference to the table's sstables_manager on each shard. This will be used be later patches to limit concurrency and to subscribe for notifications. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	b270d552fb	database: expose get_sstables_manager(schema) Return either the system or use sstables manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	9a4b4afade	db: snapshot: backup_task: do_backup: prioritize sstables that are already deleted from the table Detect SSTables that are already deleted from the table in process_snapshot_dir when their number_of_links is equal to 1. Note that the SSTable may be hard-linked by more than one snapshot, so even after it is deleted from the table, its number of links would be greater than one. In that case, however, uploading it earlier won't help to free-up its capacity since it is still held by other snapshots. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	4b8699e278	db: snapshot-ctl: pass table_id to backup_task To be used by the following patches to get to the table's sstables_manager for concurrency control and for notifications (TBD). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	d646603bfd	db: snapshot-ctl: expose sharded db() getter Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	63bc1d4626	db: snapshot: backup_task: do_backup: organize components by sstable generation Do not rely on the snapshot directory listing order. This will become useful for prioritizing unlinked sstables in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:06 +03:00
Benny Halevy	a731c1b33d	db: snapshot: coroutinize backup_task Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:49:53 +03:00
Benny Halevy	189075b885	db: snapshot: backup_task: refactor backup_file out of uploads_worker Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:49:53 +03:00
Benny Halevy	e3ba425c2b	db: snapshot: backup_task: refactor uploads_worker out of do_backup Let do_backup deal only with the high level coordination. A future patch will follow this structure to run uploads_worker on each shard. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:49:53 +03:00
Benny Halevy	ff25b4c97f	db: snapshot: backup_task: process_snapshot_dir: initialize total progress Now we can calculate advance how much data we intend to upload before we start uploading it. This will be used also later when uploading in parallel on all shards, so we can collect the progress from all shards in get_progress(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:49:51 +03:00
Benny Halevy	6da215e8af	utils/s3: upload_progress: init members to 0 For default construction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:44:52 +03:00
Benny Halevy	70307e8120	db: snapshot: backup_task: do_backup: refactor process_snapshot_dir Do preliminary listing of the snapshot dir. While at it, simplify the loop as follows: The optional directory_entry returned by snapshot_dir_lister.get() can be checked as part of the loop condition expression, and with that, error handling can be simplified and moved out of the loop body. A followup patch will organize the component files by their sstable generation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> db: snapshot: backup_task: process_snapshot_dir: simplify loop Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:44:52 +03:00
Benny Halevy	8a4b6b9614	db: snapshot: backup_task: keep expection as member As part of refactoring do_backup(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:44:52 +03:00
Botond Dénes	b65a76ab6f	Merge 'nodetool: cluster repair: add a command to repair tablet keyspaces' from Aleksandra Martyniuk Add a new nodetool cluster super-command. Add nodetool cluster repair command to repair tablet keyspaces. It uses the new /storage_service/tablets/repair API. The nodetool cluster repair command allows you to specify the keyspace and tables to be repaired. A cluster repair of many tables will request /storage_service/tablets/repair and wait for the result synchronously for each table. The nodetool repair command, which was previously used to repair keyspaces of any type, now repairs only vnode keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/22409. Needs backport to 2025.1 that introduces the new tablet repair API Closes scylladb/scylladb#22905 * github.com:scylladb/scylladb: docs: nodetool: update repair and add tablet-repair docs test: nodetool: add tests for cluster repair command nodetool: add cluster repair command nodetool: repair: extract getting hosts and dcs to functions nodetool: repair: warn about repairing tablet keyspaces nodetool: repair: move keyspace_uses_tablets function	2025-04-09 08:20:34 +03:00
Botond Dénes	5f697d373f	test/cqlpy/test_tools.py: use AIO backend in scylla-sstable query tests These tests seem to be hitting the io-uring bug in the kernel from time-to-time, making CI flaky. Force the use of the AIO backend in these tests, as a workaround until fixed kernels (>=6.8.13) are available. Fixes: #23517 Fixes: #23546 Closes scylladb/scylladb#23648	2025-04-08 20:29:58 +03:00
Benny Halevy	dfdca2d84e	locator: topology: drop unused calculate_datacenters Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23647	2025-04-08 19:04:56 +03:00
Tomasz Grabiec	06b49bdf69	Merge 'row_cache: don't garbage-collect tombstones which cover data in memtables' from Botond Dénes The row cache can garbage-collect tombstones in two places: 1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it; 2) During reads - reads now compact data including garbage collection; In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables. This PR includes fixes for (2), which were not handled at all currently. (1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included. Fixes: https://github.com/scylladb/scylladb/issues/23291 Fixes: https://github.com/scylladb/scylladb/issues/23252 The fix will need backport to all live release. Closes scylladb/scylladb#23255 * github.com:scylladb/scylladb: test/boost/row_cache_test: add memtable overlap check tests replica/table: add error injection to memtable post-flush phase utils/error_injection: add a way to set parameters from error injection points test/cluster: add test_data_resurrection_in_memtable.py test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts replica/mutation_dump: don't assume cells are live replica/database: do_apply() add error injection point replica: improve memtable overlap checks for the cache replica/memtable: add is_merging_to_cache() db/row_cache: add overlap-check for cache tombstone garbage collection mutation/mutation_compactor: copy key passed-in to consume_new_partition()	2025-04-08 17:26:58 +02:00
Andrzej Jackowski	c12f976389	audit: add semaphore to audit_syslog_storage_helper audit_syslog_storage_helper::syslog_send_helper uses Seastar's net::datagram_channel to write to syslog device (usually /dev/log). However, datagram_channel.send() is not fiber-safe (ref seastar#2690), so unserialized use of send() results in packets overwriting its state. This, in turn, causes a corruption of audit logs, as well as assertion failures. To workaround the problem, a new semaphore is introduced in audit_syslog_storage_helper. As storage_helper is a member of sharded audit service, the semaphore allows for one datagram_channel.send() on each shard. Each audit_syslog_storage_helper stores its own datagram_channel, therefore concurrent sends to datagram_channel are eliminated. This change: - Introduce semaphore with count=1 in audit_syslog_storage_helper. - Added 1 hour timeout to the semaphore, so semaphore stalls are failed just as all other syslog auditing failures. Fixes: scylladb#22973	2025-04-08 16:24:42 +02:00
Andrzej Jackowski	889fd5bc9f	audit: corutinize audit_syslog_storage_helper This change: - Corutinize audit_syslog_storage_helper::syslog_send_helper - Corutinize audit_syslog_storage_helper::start - Corutinize audit_syslog_storage_helper::write	2025-04-08 16:24:42 +02:00
Andrzej Jackowski	dbd2acd2be	audit: moved syslog_send_helper to audit_syslog_storage_helper This change: - Make syslog_send_helper() a method of audit_syslog_storage_helper, so syslog_send_helper() can access private members of audit_syslog_storage_helper in the next commits. - Remove unneeded syslog_send_helper() arguments that now are class members.	2025-04-08 16:24:42 +02:00
Benny Halevy	f702adf6a5	main: fix typo in tablet allocator checkpoint message Inroduced in `b6705ad48b` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23211	2025-04-08 17:19:41 +03:00
Botond Dénes	583a813d17	docs/dev/tombstone.md: fix link to ddl.html Closes scylladb/scylladb#23622	2025-04-08 16:18:50 +03:00
Anna Stuchlik	93a7b3ac1d	doc: add enabling consistent topology updates to the 2025.1 upgrade guide-from-2024 This commit adds the procedure to enable consistent topology updates for upgrades from 2024.1 to 2025.1 (or from 2024.2 to 2025.1 if the feature wasn't enabled after upgrading from 2024.1 to 2024.2). Fixes https://github.com/scylladb/scylladb/issues/23650 Closes scylladb/scylladb#23651	2025-04-08 15:38:00 +03:00
Robert Bindar	4e3eb2fdac	Move direct_failure_detector from root to service/ direct_failure_detector used to be used by gms/ as well, but that's not the case anymore, so raft/ is the only user. Fixes #23133 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#23248	2025-04-08 13:03:24 +03:00
Aleksandra Martyniuk	372b562f5e	test: add test for rebuild with repair	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	acd32b24d3	locator: service: move to rebuild_v2 transition if cluster is upgraded If cluster is upgraded to version containing rebuild_v2 transition kind, move to this transition kind instead of rebuild.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	eb17af6143	locator: service: add transition to rebuild_repair stage for rebuild_v2 Modify write_both_read_old and streaming stages in rebuild_v2 transition kind: write_both_read_old moves to rebuild_repair stage and streaming stage streams data only from one replica.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	4a847df55c	locator: service: add rebuild_repair tablet transition stage Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. rebuild_repair is a stage that will be used to perform the repair phase. It executes the tablet repair on tablet_info::replicas. A primary replica out of migration_streraming_info::read_from is the repair master. If the repair succeeds, we move to streaming tablet transition stage, and to cleanup_target - if it fails. The repair bypasses the tablet repair scheduler and it does not update the repair_time. A transition to the rebuild_repair stage will be added in the following patches.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	5d6041617b	locator: add maybe_get_primary_replica Add maybe_get_primary_replica to choose a primary replica out of custom replica set.	2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk	ed7b8bb787	locator: service: add rebuild_v2 tablet transition kind Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. To differentiate the two streaming methods, a new tablet transition kind - rebuild_v2 - is added. The transtions and stages for rebuild_v2 transition kind will be added in the following patches.	2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk	b80e957a40	gms: add REPAIR_BASED_TABLET_REBUILD cluster feature	2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk	9769d7a564	docs: nodetool: update repair and add tablet-repair docs	2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk	02fb71da42	test: nodetool: add tests for cluster repair command	2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk	8bbc5e8923	nodetool: add cluster repair command Add a new nodetool cluster repair command that repairs tablet keyspaces. Users may specify keyspace and tables that they want to repair. If the keyspace and tables are not specified, all tablet keyspaces are repaired. The command calls the new tablet repair API /storage_service/tablets/repair.	2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk	aa3973c850	nodetool: repair: extract getting hosts and dcs to functions	2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk	b81c81c7f4	nodetool: repair: warn about repairing tablet keyspaces Warn about an attempt to repair tablet keysapce with nodetool repair. A nodetool cluster repair command to repair tablet keyspaces will be added in the following patches.	2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk	cbde835792	nodetool: repair: move keyspace_uses_tablets function	2025-04-08 09:13:14 +02:00
Yaron Kaikov	2dc7ea366b	.github: Make "make-pr-ready-for-review" workflow run in base repo in `57683c1a50` we fixed the `token` error, but removed the checkout part which causing now the following error ``` failed to run git: fatal: not a git repository (or any of the parent directories): .git ``` Adding the repo checkout stage to avoid such error Fixes: https://github.com/scylladb/scylladb/issues/22765 Closes scylladb/scylladb#23641	2025-04-08 09:30:18 +03:00
Raphael S. Carvalho	0f59deffaa	replica: Fix truncate and drop table after tablet migration happens When running those operations after a tablet replica is migrated away from a shard, an assert can fail resulting in a crash. Status quo (around the assert in truncate procedure): 1) Highest RP seen by table is saved in low_mark, and the current time in low_mark_at. 2) Then compaction is disabled in order to not mix data written before truncate, and data written later. 3) Then memtable is flushed in order for the data written before truncate to be available in sstables and then removed. 4) Now, current time is saved in truncated_at, which is supposedly the time of truncate to decide which sstables to remove. Note: truncated_at is likely above low_mark_at due to steps 2 and 3. The interesting part of the assert is: (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp) Note: RP in the assert above is the highest RP among all sstables generated before truncated_at. RP is retrieved by table::discard_sstables(). If truncated_at > low_mark_at, maybe newer data was written during steps 2 and 3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with RP > low_mark. So assert's 2nd condition is there to defend against the scenario above. truncated_at and low_mark_at uses millisecond granularity, so even if truncated_at == low_mark_at, data could have been written in steps 2 and 3 (during same MS window), failing the assert. This is fragile. Reproducer: To reproduce the problem, truncated_at must be > low_mark_at, which can easily happen with both drop table and truncate due to steps 2 and 3. If a shard has 2 or more tablets, the table's highest RP refer to just one tablet in that shard. If the tablet with the highest RP is migrated away, then the sstables in that shard will have lower RP than the recorded highest RP (it's a table wide state, which makes sense since CL is shared among tablets). So when either drop table or truncate runs, low_mark will be potentially bigger than highest RP retrieved from sstables. Proposed solution: The current assert is hacked to not fail if writes sneak in, during steps 2 and 3, but it's still fragile and seems not to serve its real purpose, since it's allowing for RP > low_mark. We should be able to say that low_mark >= RP, as a way of asserting we're not leaving data targeted by truncate behind (or that we're not removing the wrong data). But the problem is that we're saving low_mark in step 1, before preparation steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying all data written so far is targeted for removal. But as of today, low_mark refers to all data written up to step 1. So low_mark is now only one set before issuing flush, and also accounts for all potentially flushed data. Fixes #18059. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23560	2025-04-08 07:32:58 +03:00
Botond Dénes	0d39091df2	test/boost/row_cache_test: add memtable overlap check tests Similar to test/cluster/test_data_resurrection_in_memtable.py but works on a single node and uses more low-level mechanism. These tests can also reproduce more advanced scenarios, like concurrent reads, with some reading from flushed memtables.	2025-04-08 00:11:36 -04:00
Botond Dénes	6c1f6427b3	replica/table: add error injection to memtable post-flush phase After the memtable was flushed to disk, but before it is merged to cache. The injection point will only active for the table specified in the "table_name" injection parameter.	2025-04-08 00:11:36 -04:00
Botond Dénes	f7938e3f8b	utils/error_injection: add a way to set parameters from error injection points With this, now it is possible to have two-way communication between the error injection point and its enabler. The test can enable the error injection point, then wait until it is hit, before proceedin.	2025-04-08 00:11:36 -04:00
Botond Dénes	34b18d7ef4	test/cluster: add test_data_resurrection_in_memtable.py Reproducers for #23252 and #23291 -- cache garbage collecting tombstones resurrecting data in the memtable.	2025-04-08 00:11:36 -04:00
Botond Dénes	e5afd9b5fb	test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts Such that a given index in the return hosts refers to the same underlying Scylla instance, as the same index in the passed-in nodes list. This is what users of this method intuitively expect, but currently the returned hosts list is unordered (has random order).	2025-04-08 00:11:36 -04:00
Botond Dénes	df09b3f970	replica/mutation_dump: don't assume cells are live Currently the dumper unconditionally extracts the value of atomic cells, assuming they are live. This doesn't always hold of course and attempting to get the value of a dead cell will lead to marshalling errors. Fix by checking is_live() before attempting to get the cell value. Fix for both regular and collection cells.	2025-04-08 00:11:36 -04:00
Botond Dénes	cb76cafb60	replica/database: do_apply() add error injection point So writes (to user tables) can be failed on a replica, via error injection. Should simplify tests which want to create differences in what writes different replicas receive.	2025-04-08 00:11:35 -04:00
Botond Dénes	d126ea09ba	replica: improve memtable overlap checks for the cache The current memtable overlap check that is used by the cache -- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only checks the active memtable, so memtables which are either being flushed or are already flushed and also have active reads against them do not participate in the overlap check. This can result in temporary data resurrection, where a cache read can garbage-collect a tombstone which still covers data in a flushing or flushed memtable, which still have active read against it. To prevent this, extend the overlap check to also consider all of the memtable list. Furthermore, memtable_list::erase() now places the removed (flushed) memtable in an intrusive list. These entries are alive only as long as there are readers still keeping an `lw_shared_ptr<memtable>` alive. This list is now also consulted on overlap checks.	2025-04-08 00:11:35 -04:00
Botond Dénes	7e600a0747	replica/memtable: add is_merging_to_cache() And set it when the memtable is merged to cache.	2025-04-08 00:11:35 -04:00
Botond Dénes	6b5b563ef7	db/row_cache: add overlap-check for cache tombstone garbage collection The cache should not garbage-collect tombstone which cover data in the memtable. Add overlap checks (get_max_purgeable) to garbage collection to detect tombstones which cover data in the memtable and to prevent their garbage collection.	2025-04-08 00:11:35 -04:00
Botond Dénes	c2518cdf1a	mutation/mutation_compactor: copy key passed-in to consume_new_partition() This doesn't introduce additional work for single-partition queries: the key is copied anyway on consume_end_of_stream(). Multi-partition reads and compaction are not that sensitive to additional copy added. This change fixes a bug in the compacting_reader: currently the reader passes _last_uncompacted_partition_start.key() to the compactor's consume_new_partition(). When the compactor emits enough content for this partition, _last_uncompacted_partition_start is moved from to emit the partition start, this makes the key reference passed to the compaction corrupt (refer to moved-from value). This in turn means that subsequent GC checks done by the compactor will be done with a corrupt key and therefore can result in tombstone being garbage-collected while they still cover data elsewhere (data resurrection). The compacting reader is violating the API contract and normally the bug should be fixed there. We make an exception here because doing the fix in the mutation compactor better aligns with our future plans: * The fix simplifies the compactor (gets rid of _last_dk). * Prepares the way to get rid of the consume API used by the compactor.	2025-04-08 00:11:35 -04:00
Avi Kivity	8d2a41db82	Merge "Fixes for gossiper conversion to host id" from Gleb " The series contains fixes to gossiper conversion to host id. There are two fixes where we could erroneously send outdated entry in a gossiper message and a fix for force_remove_endpoint which was not converted to work on host id and this caused it to not delete the entry in some cases (in replace with the same ip case). " * 'gleb/host-id-fixes' of github.com:scylladb/scylla-dev: gossiper: send newest entry in a digest message gossiper: change make_random_gossip_digest to return value instead of modifying passed parameter gossiper: move force_remove_endpoint to work on host id gossiper: do not send outdated endpoint in gossiper round	2025-04-07 17:04:28 +03:00
Michał Chojnowski	827d774241	test_sstable_compression_dictionaries: reproduce an internal error in debug logging Extend one of the test so that it reproduces #23624, by creating a situation where no-compression SSTables are handled with debug logging enabled.	2025-04-07 13:05:04 +02:00
Michał Chojnowski	056da4b326	compress: fix an internal error when a specific debug log is enabled While iterating over the recent `69684e16d8`, series I shot myself in the foot by defining `algorithm_to_name(algorithm::none)` to be an internal error, and later calling that anyway in a debug log. (Tests didn't catch it because there's no test which simultaneously enables the debug log and configures some table to have no compression). This proves that `algorithm_to_name` is too much of a footgun. Fix it so that calling `algorithm_to_name(algorithm::none)` is legal. In hindsight, I should have done that immediately.	2025-04-07 13:05:03 +02:00
dependabot[bot]	a899cae158	build(deps): bump sphinx-scylladb-theme from 1.8.5 to 1.8.6 in /docs Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.5 to 1.8.6. - [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases) - [Commits](https://github.com/scylladb/sphinx-scylladb-theme/compare/1.8.5...1.8.6) --- updated-dependencies: - dependency-name: sphinx-scylladb-theme dependency-version: 1.8.6 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#23537	2025-04-07 13:42:19 +03:00
Emil Maskovsky	76ceaf129b	raft: distribute voters by rack inside DC Distribute the voters evenly across racks in the datacenters. When distributing the voters across datacenters, the datacenters with more racks will be preferred in case of a tie. Also, in case of asymmetric voter distribution (2 DCs), the DC with more racks will have more voters (if the node counts allow it). In case of a single datacenter, the voters will be distributed across racks evenly (in the similar manner as done for the whole datacenters). The intention is that similar to losing a datacenter, we want to avoid losing the majority if a rack goes down - so if there are multiple racks, we want to distribute the voters across them in such a way that losing the whole rack will not cause the majority loss (if possible).	2025-04-07 12:31:37 +02:00
Emil Maskovsky	831fae4bff	raft/test: fix lint warnings in `test_raft_no_quorum` Code cleanup - fixed lint warnings in `test_raft_no_quorum` test.	2025-04-07 12:31:37 +02:00
Emil Maskovsky	92f6662cd1	raft/test: add the upgrade test for limited voters feature We test the upgrade scenario of the limited voters feature - first we start the cluster with the limited voters feature disabled ("old code"), then we upgrade the cluster to the version with the limited voters feature enabled ("new code"). The nodes are being upgraded one by one and we test that the cluster still works (doesn't e.g. lose the majority).	2025-04-07 12:31:37 +02:00
Emil Maskovsky	a740623fa1	raft topology: handle on_up/on_down to add/remove node from voters Adding and removing the voters based on the node up/down events. This improves the availability of the system by automatically adjusting the number of voters in the system to use the alive nodes in precedence. We can then also drop the voter removal from the `write_both_read_old` to further simplify the code - the node will be removed from the voters when it goes down. However we only can do that in case the feature is enabled.	2025-04-07 12:31:37 +02:00
Emil Maskovsky	dc6afd47b7	raft: fix the indentation after the limited voters changes Fix the indentation that needs to be changed because of the added condition. This is done separately to make it easier to review the main commit with the functional changes.	2025-04-07 12:31:37 +02:00
Emil Maskovsky	1d06ea3a5a	raft: implement the limited voters feature Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated). The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks. After each node addition or removal the voters are recalculated and rebalanced if necessary. That means: * When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked). * When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs). * If a node addition or removal causes a change in number of datacenters (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution) Special conditions for various number of DCs: * 1 DC: Can have up to the maximum allowed number of voters (5 - see below) * 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose the majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters. * 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution. At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR. Currently the voter limits will not be configurable (we might introduce configurable limits later if that would be needed/requested). The feature is enabled by the `group0_limited_voters` feature flag to avoid issues with cluster upgrade (the feature will be only enabled once all nodes in the cluster are upgraded to the version supporting the feature). Fixes: scylladb/scylladb#18793	2025-04-07 12:31:18 +02:00
Lakshmi Narayanan Sreethar	750f4baf44	replica/table::do_apply : do not check for async gate's closure The `table::do_apply()` method verifies if the compaction group's async gate is open to determine if the compaction group is active. Closing this async gate prevents any new operations but waits for existing holders to exit, allowing their operations to complete. When holding a gate, holders will observe the gate as closed when it is being closed, but this is irrelevant as they are already inside the gate and are allowed to complete. All the callers of `table::do_apply()` already enter the gate before calling the method. So, the async gate check inside `table::do_apply()` will erroneously throw an exception when the compaction group is closing despite holding the gate. This commit removes the check to prevent this from happening. Fixes #23348 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#23579	2025-04-07 13:27:22 +03:00
Emil Maskovsky	8b186ab0ff	raft: drop the voter removal from the decommission In the particular case of node decommission, this code doesn't really matter in production and only confuses us. Losing majority is an extremely rare event, and for this code to help one would have to lose majority in a very specific way (exactly half of the nodes die in a short time window during decommission), which is unrealistic. In addition, this code will be completely irrelevant (and would never be executed) once we implement #23266. Refs: scylladb/scylladb#23266	2025-04-07 12:23:25 +02:00
Emil Maskovsky	00794af94d	raft/test: disable the `stop_before_becoming_raft_voter` test The workflow of becoming a voter changes with the "limited voters" feature, as the node will no longer become a voter on its own, but the votership is being managed by the topology coordinator. This therefore breaks the `stop_before_becoming_raft_voter` test, as that injection relies on the old behavior. We will disable the test for this particular case for now and address either fixing of complete removal of the test in a follow-up task. Refs: scylladb/scylladb#23418	2025-04-07 12:23:25 +02:00
Emil Maskovsky	57df5d013e	raft/test: stop the server less gracefully in the voters test Stopping the test gracefully might hide some issues, therefore we want to stop it forcefully to make sure that the code can handle it. Added a parameter to stop gracefully or less gracefully (so that we test both cases).	2025-04-07 12:22:19 +02:00
Pavel Emelyanov	10376b5b85	db: Re-use database::snapshot_table_on_all_shards() There are two snapshot-on-all-shards methods on the database -- the one that snapshots a keyspace and the one that snapshots a vector of tables. The latter snapshots a single table with a neat helper, while the former has the helper open-coded. Re-using the helper in keyspace snapshot is worth it, but needs to patch the helper to work on uuid, rather than ks:cf pair of strings. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23532	2025-04-07 11:55:43 +02:00
Nadav Har'El	84fd52315f	alternator: in GetRecords, enforce Limit to be <= 1000 Alternator Streams' "GetRecords" operation has a "Limit" parameter on how many records to return. The DynamoDB documentations says that the upper limit on this Limit parameter is 1000 - but Alternator didn't enforce this. In this patch we begin enforcing this highest Limit, and also add a test for verifying this enforcement. As usual, the new test passes on DynamoDB, and after this patch - also on Alternator. The reason why it's useful to have some upper limit on Limit is that the existing executor::get_records() implementation does not really have preemption points in all the necessary places. In particular, we have a loop on all returned records without preemption points. We also store the returned records in a RapidJson vector, which requires a contiguous allocation. Even before this patch, GetRecords had a hard limit of 1 MB of results. But still, in some cases 1 MB of results may be a lot of results, and we can see stalls in the aforementioned places being O(number of results). Fixes #23534 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23547	2025-04-07 12:52:03 +03:00
Kefu Chai	55777812d4	s3/client: Optimize file streaming with zero-copy multipart uploads When streaming files using multipart upload, switch from using `output_stream::write(const char*, size_t)` to passing buffer objects directly to `output_stream::write()`. This eliminates unnecessary memory copying that occurred when the original implementation had to defensively copy data before sending. The buffer objects can now be safely reused by the output stream instead of creating deep copies, which should improve performance by reducing memory operations during S3 file uploads. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23567	2025-04-07 12:50:06 +03:00
Avi Kivity	ac3d25eb44	sstable_set: incremental_reader_selector: be more careful when filtering out already engaged sstables The incremental reader selector maintains an unordered_set of sstables that are already engaged, and uses std::views::filter to filter those out. It adds the sstable under consideration to the set, and if addition failed (because it's already in) then it filters it out. This breaks if the filter view is executed twice - the first pass will add every sstable to the set, and the second will consider every sstable already filtered. This is what happens with libstdc++ 15 (due to the addition of vector(from_range_t) constructor), which uses the first pass to calculate the vector size and the second pass to insert the elements into a correctly-sized vector. Fix by open-coding the loop. Closes scylladb/scylladb#23597	2025-04-07 12:49:04 +03:00
Gleb Natapov	a982db326e	gossiper: send newest entry in a digest message In cases where two entries have the same ip address send information only for the newest one. Now we send both which make the receiver use one of them at random and it may be outdated one (though it should only cause more data than needed to be requested).	2025-04-06 18:39:24 +03:00
Gleb Natapov	8d534ee68e	gossiper: change make_random_gossip_digest to return value instead of modifying passed parameter	2025-04-06 18:39:24 +03:00
Gleb Natapov	6f53611337	gossiper: move force_remove_endpoint to work on host id Since the gossiper works on host ids now it is incorrect to leave this function to work on ip. It makes it impossible to delete outdated entry since the "gossiper.get_host_id(endpoint) != id" check will always be false for such entries (get_host_id() always returns most up -to-date mapping.	2025-04-06 18:39:24 +03:00
Amnon Heiman	b55f24c14d	alternator: Add tests for the batch items histograms This patch adds a test for the batch‑items histogram for both get and write operations. It update the check_increases_metric_exact helper function so that it would get a list of expected value and labels (labels can be None). This makes it easy to test multiple buckets in a histogram. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-04-06 18:22:23 +03:00
Amnon Heiman	c060c0b867	alternator: Add histogram for batch item count This patch adds an estimated_histogram for alternator batch item count. estimated_histogram can be used with values starting from 1 with an exponential factor of 1.2, which nicely covers values up to 20, but with only 22 buckets it can reach all the way to 100 (plus infinity). Aside from the new histograms for get and write batches, a helper function was added to return the histogram in the metric format without changing its resolution (which is the metric’s default behaviour). The histogram will be reported once per node rather than once per shard. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-04-06 18:22:13 +03:00
Marcin Maliszkiewicz	b94acfb37b	test: remove alternator code from perf-simple-query This kind of benchmark was superseded by perf-alternator which has more options, workflows and most importantly measures overhead of http server layer (including json parsing). There is no need to maintain additional code in perf-simple-query. Closes scylladb/scylladb#23474	2025-04-06 18:15:16 +03:00
Pavel Emelyanov	d4f3a3ee4f	cql: Remove unused "initial_tablets" mention from guardrails All tablets configuration was moved into its own "with tablets" section, this option name cannot be met among replication factors. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23555	2025-04-06 16:52:07 +03:00
Gleb Natapov	df6cd87bcc	gossiper: do not send outdated endpoint in gossiper round Now that the gossiper map is id based there can be a situation where two entries have the same ip, Shadow round should send the newest one in this cased. The patch makes it so. Fixes: #23553	2025-04-06 15:08:03 +03:00
Nadav Har'El	431de48df9	test/alternator: test for item with many attributes A user complained that he couldn't read or write an item with more than 16 attributes (!) in Alternator. This isn't true, but I realized that we don't have a simple test for this case - all test use just a few attributes. So let's add such a test, doing PutItem, UpdateItem and GetItem with 400 attributes. Unsurprisingly, the test passes. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23568	2025-04-03 22:35:49 +03:00
Nadav Har'El	a9a6f9eecc	test/alternator: increase timeout in Alternator RBAC test On our testing infrastructure, tests often run a hundred times (!) slower than usual, for various reasons that we can't always avoid. This is why all our test frameworks drastically increase the default timeouts. We forgot to increase the timeout in one place - where Alternator tests use CQL. This is needed for the Alternator role-based access control (RBAC) tests, which is configured via CQL and therefore the Alternator test unusually uses CQL. So in this patch we increase the timeout of CQL driver used by Alternator tests to the same high timeouts (60-120 seconds) used by the regular CQL tests. As the famous saying goes, these timeouts should be enough for anyone. Fixes #23569. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23578	2025-04-03 22:31:08 +03:00
Benny Halevy	cdf9fe9e50	Update seastar submodule * seastar 2f13c461...ed8952fb (24): > file: explain dsync check in flush method > gate: add named_gate > tests: unit: add gate_test > reactor: Remove global task_quota extern declaration > future: Move report_failed_future to internal namespace > update boost cooking URL > smp: prefault: clear memory map after threads join > change format to sesatar::format > Prevent move / copy constructor / assignment on backtrace_buffer > Remove unnecesary flush calls from backtrace_buffer usage points > Make backtrace_buffer flush on destruction > Add `backtrace_buffer&` param to maybe_report_kernel_trace function > Prevent empty kernel callstack messages > Make cpu_stall_detector_linux_perf_event::maybe_report_kernel_trace function protected. > iotune: Add cli flag to force io depth > smp: prefault: decouple _stop_request from join_threads > reactor: more info, robustness on segfault > net/udp: fix ipv4_udp::next_port calculation > map_reduce: prevent mapper or reducer exception from poisoning state > build: Re-enable ASan's verify_asan_link_order check > tests: enable/disable internet-dependent tests at runtime > test: tls_test: rename test_simple_x509_client variants to avoid naming conflicts > tests: extend test.py to accept arbitrary ctest parameters from positional args > tests: add a handle for building tests in "offline" mode Closes scylladb/scylladb#23566	2025-04-03 19:45:37 +03:00
Botond Dénes	1198213000	Merge 'tablets: Make tablet allocation equalize per-shard load ' from Tomasz Grabiec Before, it was equalizing per-node load (tablet count), which is wrong in heterogeneous clusters. Nodes with fewer shards will end up with overloaded shards. Refs #23378 Closes scylladb/scylladb#23478 * github.com:scylladb/scylladb: tablets: Make tablet allocation equalize per-shard load tablets: load_balancer: Fix reporting of total load per node	2025-04-03 16:32:53 +03:00
Botond Dénes	fcdae20fd1	Merge 'Add tablet enforcing option' from Benny Halevy This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing `enable_tablets` option. It can be set to the following values: disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option enabled: New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option `tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether tablets are disabled or enabled by default for new keyspaces, respectively. In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}` keyspace option, when the keyspace is created. `tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`. However, it does not allow to opt-out when creating new keyspaces by setting `tablets = {'enabled': false}` Refs scylladb/scylla-enterprise#4355 * Requires backport to 2025.1 Closes scylladb/scylladb#22273 * github.com:scylladb/scylladb: boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option db/config: add tablets_mode_for_new_keyspaces option	2025-04-03 16:32:19 +03:00
Kefu Chai	3760a1c85e	cql3: Remove unnecessary 'virtual' specifiers from final class methods Remove 'virtual' specifiers from member functions in final classes where they can never be overridden. This addresses Clang errors like: ``` /home/kefu/dev/scylladb/cql3/column_identifier.hh:85:21: error: virtual method 'to_string' is inside a 'final' class and can never be overridden [-Werror,-Wunnecessary-virtual-specifier] 85 \| virtual sstring to_string() const; \| ^ 1 error generated. ``` This change improves code clarity and maintainability by eliminating redundant modifiers that could cause confusion. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23570	2025-04-03 13:51:42 +03:00
Tomasz Grabiec	fe8187e594	Merge 'repair: release erm in repair_writer_impl::create_writer when possible' from Aleksandra Martyniuk Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed. Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked. Fixes: #23453. Needs backport to 2025.1 that introduces the tablet repair scheduler. Closes scylladb/scylladb#23455 * github.com:scylladb/scylladb: \test: add test to check concurrent migration and repair of two different tablets repair: release erm in repair_writer_impl::create_writer when possible	2025-04-03 11:15:08 +02:00
Botond Dénes	7bbfa5293f	test/cluster/test_read_repair.py: increase read request timeout This test enables trace-level logging for the mutation_data logger, which seems to be too much in debug mode and the test read times out. Increase timeout to 1minute to avoid this. Fixes: #23513 Closes scylladb/scylladb#23558	2025-04-03 10:42:11 +03:00
Botond Dénes	07510c07a0	readers/mutation_readers: queue_reader_handle_v2::push_end_of_stream() raise _ex if set Instead of raising std::runtime_error("Dangling queue_reader_handle_v2") unconditionally. push() already raises _ex if set, best to be consistent. Unconditionally raising std::runtime_error can cause an error to be logged, when aborting an operation involving a queue reader. Although the original exception passed to queue_reader_handle_v2::abort() is most likely handled by higher level code (not logged), the generic std::runtime_error raised is not and therefore is logged. Fixes: #23550 Closes scylladb/scylladb#23554	2025-04-03 10:39:56 +03:00
Pavel Emelyanov	3bf4768205	Merge 'Unify http transport in EAR to use seastar http client' from Calle Wilund Fixes #22925 Refs #22885 Some providers in EAR were written before seastar got its own native http connector (as it is). Thus hand-made connectivity is used there. This PR unifies the code paths, and also extract some abstraction between providers where possible. One big reason for this is the handling of abrupt disconnects and retries; Seastar has some handling of things like EPIPE and ECONNRESET situations, that can be safely ignored in a REST call iff data was in fact transferred etc. This PR mainly takes the usage of seastar httpclient from gcp connector, makes a wrapper matching most of the usage of local client in kms connector, ensures common functionality and the replaces the code in the individual connectors. Closes scylladb/scylladb#22926 * github.com:scylladb/scylladb: encryption::gcp: Use seastar http client wrapper encryption::kms: Drop local http client and use seastar wrapper encryption: Break out a "httpclient" wrapper for seastar httpclient	2025-04-03 10:35:14 +03:00
Kefu Chai	0cd6cf1dc5	main: Remove unused member variable `_sys_ks` Fixes a Clang error by removing the unused private field `sstable_dict_deleter::_sys_ks` that was flagged with: [-Werror,-Wunused-private-field] ``` /home/kefu/.local/bin/clang++ -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_PROGRAM_OPTIONS_NO_LIB -DSCYLLA_BUILD_MODE=release -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/build -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/RelWithDebInfo/seastar/gen/include -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -I/usr/include/p11-kit-1 -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -flto=thin -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/= -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -ffile-prefix-map=/home/kefu/dev/scylladb/build/=build -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -mllvm -inline-threshold=2500 -fno-slp-vectorize -ffat-lto-objects -std=gnu++23 -Werror=unused-result -DSEASTAR_API_LEVEL=7 -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=19 -DSEASTAR_LOGGER_TYPE_STDOUT -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_THREAD_NO_LIB -DBOOST_THREAD_DYN_LINK -DFMT_SHARED -MD -MT CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o -MF CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o.d -o CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o -c /home/kefu/dev/scylladb/main.cc /home/kefu/dev/scylladb/main.cc:1660:38: error: private field '_sys_ks' is not used [-Werror,-Wunused-private-field] 1660 \| db::system_keyspace& _sys_ks; \| ^ ``` The member variable is not referenced anywhere in the code, so removing it improves maintainability without affecting functionality. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23545	2025-04-02 20:07:39 +03:00
Evgeniy Naydanov	84a5037056	test.py: cluster/suite.yaml: update test filters After switching to subfolders the filter `run_in_debug` for random failures test was just copied as is, but need to include the subfolder, actually. Also, `test_old_ip_notification_repro` was deleted, so, we don't need it in the `skip_in_debug` list. Closes scylladb/scylladb#23492	2025-04-02 19:29:27 +03:00
Kefu Chai	a09ec9d60d	.github: add delay before checking for required PR labels Improve the GitHub workflow to prevent premature email notifications about missing labels. Previously, contributors without write permissions to the scylladb repo would receive immediate notification emails about missing required backport labels, even if they were in the process of adding them. This change introduces a 1-minute grace period before checking for required labels, giving contributors sufficient time to add necessary labels (like backport labels) to their pull requests before any warning notifications are sent. The delay makes the experience more user-friendly for non-maintainer contributors while maintaining the labeling requirements. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23539	2025-04-02 19:28:15 +03:00
Aleksandra Martyniuk	bae6711809	\test: add test to check concurrent migration and repair of two different tablets	2025-04-02 15:30:17 +02:00
Radosław Cybulski	c36614e16d	alternator: add size check to BatchItemWrite Add a size check for BatchItemWrite command - if the item count is bigger than configuration value `alternator_maximum_batch_write_size`, an error will be raised and no modification will happen. This is done to synchronize with DynamoDB, where maximum size of BatchItemWrite is 25. To avoid complaints from clients, who use our feature of BatchWriteItem being limitless we set default value to 100. Fixes #5057 Closes scylladb/scylladb#23232	2025-04-02 14:48:00 +03:00
Avi Kivity	882f405eed	Merge "Convert gossiper's endpoint state map to be host id based" from Gleb " The series makes endpoint state map in the gossiper addressable by host id instead of ips. The transition has implication outside of the gossiper as well. Gossiper based topology operations are affected by this change since they assume that the mapping is ip based. On wire protocol is not affected by the change as maps that are sent by the gossiper protocol remain ip based. If old node sends two different entries for the same host id the one with newer generation is applied. If new node has two ids that are mapped to the same ip the newer one is added to the outgoing map. Interoperability was verified manually by running mixed cluster. The series concludes the conversion of the system to be host id based. " * 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev: gossiper: make examine_gossiper private gossiper: rename get_nodes_with_host_id to get_node_ip treewide: drop id parameter from gossiper::for_each_endpoint_state treewide: move gossiper to index nodes by host id gossiper: drop ip from replicate function parameters gossiper: drop ip from apply_new_states parameters gossiper: drop address from handle_major_state_change parameter list gossiper: pass rpc::client_info to gossiper_shutdown verb handler gossiper: add try_get_host_id function gossiper: add ip to endpoint_state serialization: fix std::map de-serializer to not invoke value's default constructor gossiper: drop template from wait_alive_helper function gossiper: move get_supported_features and its users to host id storage_service: make candidates_for_removal host id based gossiper: use peers table to detect address change storage_service: use std::views::keys instead of std::views::transform that returns a key gossiper: move _pending_mark_alive_endpoints to host id gossiper: do not allow to assassinate endpoint in raft topology mode gossiper: fix indentation after previous patch gossiper: do not allow to assassinate non existing endpoint	2025-04-02 12:30:00 +03:00
Pavel Emelyanov	832d83ae4b	sstables_loader: Do not stop sharded<progress_monitor> unconditionally The member in question is unconditionally .stop()-ed in task's release_resources() method, however, it may happen that the thing wasn't .start()-ed in the first place. Start happens in the middle of the task's .run() method and there can be several reasons why it can be skipped -- e.g. the task is aborted early, or collecting sstables from S3 throws. fixes: #23231 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23483	2025-04-02 12:09:02 +03:00
Kefu Chai	6da758d74c	config: mark uuid_sstable_identifiers_enabled unused the option of `uuid_sstable_identifier_enabled` was introduced in `f014ccf3` . the first version which has this change was 5.4, and 6.1 has been branched. during the discussion of backup and restore, we realized that we've been taking efforts to address problems which could have been addressed with the sstable with UUID-based identifier. see also #10459 which is the issue which proposed to implement UUID-v1 based sstable identifier. now that two major releases passed, we should have the luxury to mark this option "unused". this option which was previously introduced to keep the backward compatibility, and to allow user to opt-out of the feature for some reasons. so in this change, mark the option unused, so that if any user still sets this option with command line, they will get a clear error. but we still parse and handle this setting in `scylla.yaml`, so that this option is still respected for existing settings, and for existing tests, which are not yet prepared for the uuid-based sstable identifiers. Refs #10459 Fixes #20337 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#20341	2025-04-01 20:21:47 +03:00
Botond Dénes	3bad46a6e2	docs/dev: add tombstone.md An exhaustive document on the tombstone related internal logic as well as the user-facing aspects. Closes scylladb/scylladb#23454	2025-04-01 20:17:57 +03:00
Botond Dénes	a0d8102a1f	replica/memtable: s/make_flat_reader/make_mutation_reader/ Following the recent refactoring of removing "flat" and "v2" from reader names, replacing all the fully qualified names with simply "mutation_reader". Closes scylladb/scylladb#23346	2025-04-01 17:58:13 +03:00
Artsiom Mishuta	032b28d793	test.py: remove pylib_test from test.py/CI run pylib_test contains one pure Python test. This test does not test Scylla. This test is not deleted because it can be useful to run during pre-commit, for example, but it definitely should not be run in CI in modes with 3 repeats each. It does not make sense. It is a Unit test for test.py framework. Note: test still can be easily run by pytest via the command: ./tools/toolchain/dbuild pytest test/pylib_test Closes scylladb/scylladb#23181	2025-04-01 16:43:45 +03:00
Pavel Emelyanov	2ee9cec1d3	Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed. This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f Refs https://github.com/scylladb/scylladb/issues/22428 Closes scylladb/scylladb#22952 * github.com:scylladb/scylladb: Remove db::config::object_storage_config Move `object_storage.yaml` endpoints to `scylla.yaml`	2025-04-01 16:01:44 +03:00
Avi Kivity	69684e16d8	Merge 'sstables: add SSTable compression with shared dictionaries ' from Michał Chojnowski This PR extends Scylla's SSTable compression with the ability to use compression dictionaries shared across compression chunks. This involves several changes: - We refactor `compression_parameters` and friends (`compressor`, `sstables::local_compression`, `sstables::compression`) to prepare for making the construction of `compressor`s asynchronous, to enable sharing pieces of compressors (the dictionaries) across shards. - We introduce the notion of "hidden compression options" which are written to `CompressionInfo.db` and used to construct decompressors, like regular options, but don't appear in the schema. (We later stuff the SSTable's dictionary into `CompressionInfo.db` using a sequence of such options). - We add a cluster feature which guards the creation of dictionary-compressed SSTables. - We introduce a central "compressor factory" (one instance shared by all shards), which from this point onward is used to construct all `compressor` objects (one per SSTable) used to process the SSTables. When constructing a compressor for writing, it uses the "current"/"recommended" dictionary (which is passed to the factory from the actively-observed contents of the group0-managed `system.dicts`). When constructing a compressor for reading, it uses the dictionary written in the hidden compression options in CompressionInfo.db. And it keeps dictionaries deduplicated, so that each unique live dictionary blob has only one instance in memory, shared across shards. - We teach the relevant `lz4` and `zstd` compressor wrappers about the dictionaries. - We add a HTTP API call which samples pieces of the given table (i.e. the Data.db files) from across the cluster, trains a dictionary on it, and publishes it via `system.dicts` as the new current dictionary for that table. (And we add some RPC verbs to support that). - We add a HTTP API call which estimates the impact of various available compression configurations on the compression ratio. - We add an autotrainer fiber which periodically retrains dicts for dict-aware tables and publishes them if they seem to be a significant improvement. Known imperfections: - The factory currently keeps one dictionary instance on the entire node, but we probably want one copy per NUMA node. I didn't do that because exposing NUMA knowledge to Scylla seems to require some changes in Seastar first. New feature, no backporting involved. Closes scylladb/scylladb#23025 * github.com:scylladb/scylladb: docs: add user-facing documentation for SSTable compression with shared dicts docs/dev: add sstable-compression-dicts.md test: add test_sstable_compression_dictionaries_autotrain.py test: add test_sstable_compression_dictionaries_basic.py test/pylib/rest_client: add `keyspace_upgrade_sstables` helper main: run a sstable_dict_autotrainer api: add the estimate_compression_ratios API call dict_autotrainer: introduce sstable_dict_autotrainer db/system_keyspace: add query_dict_timestamp compress: add ZstdWithDictsCompressor and LZ4WithDictsCompressor main: clean up sstable compression dicts after table drops sstables/compress: discard hidden compression options after the decompressor is created compress: change compressor_ptr from shared_ptr to unique_ptr api: add the retrain_dict API call storage_service: add some dict-related routines main: in compression_dict_updated_callback, recognize and use SSTable compression dicts storage_service: add do_sample_sstables() messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback database: add sample_data_files() database: add take_sstable_set_snapshot() compress: teach `lz4_processor` about dictionaries compress: teach `zstd_processor` about dictionaries sstables: delegate compressor creation to the compressor factory sstables: plug an `sstable_compressor_factory` into `sstables_manager` sstables: introduce sstable_compressor_factory utils/hashers: add get_sha256() gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature compress: add hidden dictionary options compress: remove `compression_parameters::get_compressor()` sstables/compress: remove get_sstable_compressor() sstables/compress: move ownership of `compressor` to `sstable::compression` compress: remove compressor::option_names() compress: clean up the constructor of zstd_processor compress: squash zstd.cc into compress.cc sstables/compress: break the dependency of `compression_parameters` on `compressor` compress.hh: switch compressor::name() from an instance member to a virtual call bytes: adapt fmt_hex to std::span<const std::byte>	2025-04-01 12:47:34 +03:00
Aleksandra Martyniuk	1dc29ddc86	repair: release erm in repair_writer_impl::create_writer when possible Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed. Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked. Fixes: #23453.	2025-04-01 11:34:21 +02:00
Calle Wilund	c6674619b7	encryption::gcp: Use seastar http client wrapper Refs #22925 Remove direct usage of seastar http client, and instead share this with other connectors via the http client wrapper type.	2025-04-01 08:18:05 +00:00
Calle Wilund	491748cde3	encryption::kms: Drop local http client and use seastar wrapper Fixes #22925 Removes the boost based http client in favour of our seastar wrapper.	2025-04-01 08:18:05 +00:00
Calle Wilund	878f76df1f	encryption: Break out a "httpclient" wrapper for seastar httpclient Refs #22925 Adds some wrapping and helpers for the kind of REST operations we expect to perform. Some things like stream formatting is redundant visavi seastar, but on that level we only have \r\n encoded writing to output_stream and similar, which is less useful for things like logging.	2025-04-01 08:18:05 +00:00
Piotr Smaron	370707b111	service: restore default timeout in `announce_with_raft` This restored timeout seems to have been accidentally removed in `7081215552 (r2005352424)`. Without it, `raft_server_with_timeouts::run_with_timeout` will get `std::nullopt` as a value of the `timeout` parameter and perform an operation without any timeout, whereas previously it would have waited for the default timeout specified in `raft_server_for_group::default_op_timeout`. Closes scylladb/scylladb#23380	2025-04-01 10:20:16 +03:00
David Garcia	6e61fc323b	docs: redirect to docs.scylladb.com/manual/ Define a custom alert to redirect users to the latest version of the docs in https://docs.scylladb.com/manual/ Closes scylladb/scylladb#22636	2025-04-01 09:22:56 +03:00
Botond Dénes	bd9f51a29c	Merge 'transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing' from Vladislav Zolotarov A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver. However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then). This patch fixes this. Fixes #23173 The issue fixed by this PR is not critical but the fix is simple and safe enough so we should backport it to all live releases. Closes scylladb/scylladb#23174 * github.com:scylladb/scylladb: CQL Tracing: set common query parameters in a single function transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing	2025-04-01 09:16:02 +03:00
Pavel Emelyanov	b5a124f60c	sstable_directory: Move highest_generation_seen() to distributed_loader.cc This method is only used by the loader code (and tests). Also, There's the highest_version_seen() peer that sits in the loader code either. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23324	2025-04-01 09:15:14 +03:00
Pavel Emelyanov	eafc767cc6	sstable/filesystem: Add convenience helper to generate filename In its operations the fs storage carefully generates full filename from all sstable parameters -- version, format, generation, keyspace and table names and component type or name. However, in all of the cases format, version and keyspace:table names are inherited from the sstable being operated on. This calls for a filename generation helper that wraps most of the arguments thus making the lines shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23384	2025-04-01 09:14:44 +03:00
Botond Dénes	0fdf2a2090	Merge 'test/pylib: servers_add: support list of property_files' from Benny Halevy So that a multi-dc/multi-rack cluster can be populated in a single call. * Enhancement, no backport required Closes scylladb/scylladb#23341 * github.com:scylladb/scylladb: test/pylib: servers_add: add auto_rack_dc parameter test/pylib: servers_add: support list of property_files	2025-04-01 09:14:20 +03:00
Botond Dénes	94e8971308	scylla-gdb.py: improve scylla repairs commadn Make output more readable by: * group follower/master repair instances separately * split repair details into one line for repair summary, then one line for each host info * add indentation to make the output easier to follow Also add -m\|--memory option to calculate memory usage of repair buffers. Example output: (gdb) scylla repairs -m Repairs for which this node is leader: (repair_meta) 0x60503ab7f7b0: {id: 19197, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 30, memory: 48208512}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started (repair_meta) 0x60503717f7b0: {id: 19211, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 63863265}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: c4936a19-41da-4260-971e-651445d740fd, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished (repair_meta) 0x60502ddff7b0: {id: 19231, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::row_level_stop_started host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::row_level_stop_finished (repair_meta) 0x60501db3f7b0: {id: 19234, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_sync_boundary_started host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::get_sync_boundary_finished (repair_meta) 0x60501c81f7b0: {id: 19236, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 42696821}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started (repair_meta) 0x60503f65f7b0: {id: 19238, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 47785163}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished Repairs for which this node is follower:	2025-04-01 01:53:35 -04:00
Botond Dénes	47c62a4cf2	scylla-gdb.py: seastar_lw_shared_ptr: add __nonzero__ and __bool__ There is currently no easy way to null-check seastar_lw_shared_ptr. Comparing get() against 0 doesn't work, if _p is null, get() will return an illegal pointer. So add methods to allow for easy null-checks by comparing _p with 0 instead.	2025-04-01 01:53:34 -04:00
Botond Dénes	f84bf43c96	scylla-gdb.py: introduce managed_bytes Extracted from managed_bytes_printer. Make working with managed_bytes easier. Abstracts how size and content is obtained.	2025-04-01 01:53:34 -04:00
Jenkins Promoter	6c528f5027	Update pgo profiles - aarch64	2025-04-01 04:45:44 +03:00
Jenkins Promoter	3c12029584	Update pgo profiles - x86_64	2025-04-01 04:27:11 +03:00
Michał Chojnowski	36be9d1c9b	docs: add user-facing documentation for SSTable compression with shared dicts	2025-04-01 00:07:31 +02:00
Michał Chojnowski	d33ffb221b	docs/dev: add sstable-compression-dicts.md	2025-04-01 00:07:31 +02:00
Michał Chojnowski	f851efd4fa	test: add test_sstable_compression_dictionaries_autotrain.py Adds a test which checks that sstable compression dict autotraining does its job.	2025-04-01 00:07:31 +02:00
Michał Chojnowski	62da3d8363	test: add test_sstable_compression_dictionaries_basic.py Add a basic integration test for SSTable compression with shared dictionaries.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	7b0eeefd79	test/pylib/rest_client: add `keyspace_upgrade_sstables` helper	2025-04-01 00:07:30 +02:00
Michał Chojnowski	3f7969313f	main: run a sstable_dict_autotrainer Create an instance of `sstable_dict_autotrainer` in `scylla_main` and run it.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	a19d6d95f7	api: add the estimate_compression_ratios API call Add an API call which estimates the effectiveness of possible compression config changes. This can be used to make an informed decision about whether to change the compression method, without actually recompressing any SSTables.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	4f0d453acf	dict_autotrainer: introduce sstable_dict_autotrainer Add a fiber responsible for periodic re-training of compression dictionaries (for tables which opted into dict-aware compression). As of this patch, it works like this: every `$tick_period` (15 minutes), if we are the current Raft leader, we check for dict-aware tables which have no dict, or a dict older than `$retrain_period`. For those tables, if they have enough data (>1GiB) for a training, we train a new dict and check if it's significantly better than the current one (provides ratio smaller than 95% of current ratio), and if so, we update the dict.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	9d02e2c005	db/system_keyspace: add query_dict_timestamp Adds a helper method which queries the creation timestamp of a given dict in `system.dicts`. We will later use the age of the current SSTable compression dict to decide if another training should be done already.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	cb1b291051	compress: add ZstdWithDictsCompressor and LZ4WithDictsCompressor Add new compressor names to `sstable_compression`. When those names are configured in the schema, new SSTables will be compressed with dict-aware Zstd or LZ4 respectively.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	bea866a46f	main: clean up sstable compression dicts after table drops When a table is dropped, its corresponding dictionary in `system.dicts` -- if any -- should be deleted, otherwise it will remain forever as garbage. This commit implements such cleanup.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	cee504f66f	sstables/compress: discard hidden compression options after the decompressor is created Dictionary contents are kept in the list of "compression options" in the header of `CompressionInfo.db`, and they are loaded from disk into memory when the `sstable::compression` object is populated. After the decompressor for the SSTable is created based on those dict contents, they are not needed in RAM anymore. And since they take up a sizeable amount of memory, we would like to free them. In this patch, we discard all "hidden compression options" (currently: only the dictionary contents) from the `sstable::compression` object right after the decompressor is created. (Those options are not supposed to be used for anything else anyway).	2025-04-01 00:07:30 +02:00
Michał Chojnowski	10fa4abde7	compress: change compressor_ptr from shared_ptr to unique_ptr Cleanup patch. After we moved the ownership of compressors to sstables, compressor objects never have shared lifetime. `unique_ptr` is more appropriate for them than `shared_ptr` now. (And besides expressing the intent better, using `unique_ptr` prevents an accidental cross-shard `shared_ptr` copy).	2025-04-01 00:07:29 +02:00
Michał Chojnowski	58ae278d10	api: add the retrain_dict API call Add an API call which will retrain the SSTable compression dictionary for a given table. Currently, it needs all nodes to be alive to succeed. We can relax this later.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	4115a6fece	storage_service: add some dict-related routines storage_service will be the interface between the API layer (or the automatic training loop) and the dict machinery. This commit implements the relevant interface for that. It adds methods that: 1. Take SSTable samples from the cluster, using the new RPC verbs. 2. Train a dict on the sample. (The trainer will be plugged in from `main`). 3. Publishes the trained dictionary. (By adding mutations to Raft group 0). Perhaps this should be moved to a separate "service". But it's not like `storage_service` has a clear purpose anyway.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	94d244ab49	main: in compression_dict_updated_callback, recognize and use SSTable compression dicts Currently, there is at most one dictionary in `system.dicts`: named "general", used by RPC compression. So the callback called on `system.dicts` just always refreshes the RPC compression dict. In a follow-up commit, we will publish SSTable compression dicts to `system.dicts` rows with a name in the "sstables/{table_uuid}" format. We want modification to such rows to be passed as new dictionary recommendations to the SSTable compressor factory. This commit teaches the `system.dicts` modification callback to recognize such modifications and forward them to the compressor factory.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	380f409c46	storage_service: add do_sample_sstables() Adds a helper which uses ESTIMATE_SSTABLE_VOLUME and SAMPLE_SSTABLES RPC calls to gather a combined sample of SSTable Data files for the given table from the entire cluster.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	94c33b6760	messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs Add two verbs needed to implement dictionary training for SSTable compression. SAMPLE_SSTABLES returns a list of randomly-selected chunks of Data files with a given cardinality and using a given chunk size, for the given table. ESTIMATE_SSTABLE_VOLUME returns the total uncompressed size of all Data files the given table.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	4856f4acca	db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict Extend the `system.dicts` helper for querying and modifying `system.dicts` with an ability to use names other than "general". We will use that in later commits to publish dictionaries for SSTable compression.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	b77c611c00	raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback Before this patch, `system.dicts` contains only one dictionary, for RPC compression, with the fixed name "general". In later parts of this series, we will add more dictionaries to system.dicts, one per table, for SSTable compression. To enable that, this patch adjusts the callback mechanism for group0's `write_mutations` command, so that the mutation callbacks for group0-managed tables can see which partition keys were affected. This way, the callbacks can query only the modified partitions instead of doing a full scan. (This is necessary to prevent quadratic behaviours.) For now, only the `system.dicts` callback uses the partition keys.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	d920ab5366	database: add sample_data_files() Add a helper for sampling the Data files for a given table. We will use it to take samples for dictionary training.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	48c06c7e4b	database: add take_sstable_set_snapshot() We want a method that will allow us to take a stable snapshot of SSTables, to asynchronously compute some stats on them. But `take_storage_snapshot` is overly invasive for that, because it flushes memtables on each call. (If `take_storage_snapshot` was, for example, called repetitively, it could create a ton of small memtables and lead to trouble). This commit adds a weaker version which only takes a snapshot of existing SSTables, and doesn't flush memtables by itself. This will be useful for dictionary training, which doesn't care about the semantics of SSTables, only their rough statistical properties.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	64f3d7e364	compress: teach `lz4_processor` about dictionaries Extend `lz4_processor` with the ability to use dictionaries. We won't use this ability yet. It will be used when new compressor names are added.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	b65101b371	compress: teach `zstd_processor` about dictionaries Extend `zstd_processor` with the ability to use dictionaries. We won't use this ability yet. It will be used when new compressor names are added.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	b18ddcb92e	sstables: delegate compressor creation to the compressor factory Remove `compressor::create()`. This enforces that compressors are only created through the `sstable_compressor_factory`. Unlike the synchronous `compressor::create()`, the factory will be able to create dict-aware compressors.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	30a9d471fa	sstables: plug an `sstable_compressor_factory` into `sstables_manager` Create a `sstable_compressor_factory_impl` in `scylla_main`, and pipe it through constructors into `sstables_manager`. In next commits, the factory available through the `sstables_manager` will be used to create compressors for SSTable readers and writers.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	ebf02913a2	sstables: introduce sstable_compressor_factory Before this commit, `compressor` objects are synchronously created, during the creation or opening of SSTables, from `compression_parameters` objects. But we want to add compression dictionaries to SSTables and we want to share dictionary contents across shards. To do that, we need to make the creation of `compressor` objects asynchronous, and give it access to a global dictionary registry. We encapsulate that in a `sstable_compression_factory`. Instead of calling `compressor::create()` on SSTable opening or creation, we will ask the factory, asynchronously, for a new compressor, and it will return a compressor with a deduplicated, up-to-date dictionary. This commit introduces such a factory. It's not used anywhere yet, and the compressors it produces don't use the provided dictionaries yet.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	2bd393849c	utils/hashers: add get_sha256() Add a helper function which computes the SHA256 for a blob. We will use it to compute identifiers for SSTable compression dictionaries later.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	61316e29df	gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature This feature will guard against writing SSTables containing compression dictionaries before the entire cluster is able to understand them.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	dd932ebb2f	compress: add hidden dictionary options Before this commit, "compression options" written into CompressionInfo.db (and used to construct a decompressor) have a 1:1 correspondence to "compression options" specified in the schema. But we want to add a new "compression option" -- the compression dictionary -- which will be written into CompressionInfo.db and used to construct decompressors, but won't be specified in the schema. To reconcile that, in this commit we introduce the notion of a "hidden option". If an option name in `CompressionInfo.db` begins with a dot, then this option will be used to construct decompressors, but won't be visible for other uses. (I.e. for the `sstable_info` API call and for recovering a fake `schema` from `CompressionInfo.db` in the `scylla sstable` tool). Then, we introduce the hidden `.dictionary.{0,1,2,..}` options, which hold the contents of the dictionary blob for this SSTable. (The dictionary is split into several parts because the SSTable format limits the length of a single option value to 16 bits, and dictionaries usually have a length greater than that). This commit only introduces helpers which translate dictionary blobs into "options" for CompressionInfo.db, and vice-versa, but it doesn't use those helpers yet. They will be used in later commits.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	11be7c0704	compress: remove `compression_parameters::get_compressor()` Following up on the previous commits, we avoid constructing compressors where not necessary, by checking things directly on `compression_parameters` instead.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	006c631642	sstables/compress: remove get_sstable_compressor() Following up on the previous commit, we avoid constructing a compressor in the `sstable_info` API call, and we instead read the compression options from the `sstable::compression`.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	8e611536b0	sstables/compress: move ownership of `compressor` to `sstable::compression` SSTable readers and writers use `compressor` objects to compress and decompress chunks of SSTable data files. `compressor` objects are read-only, so only one of them is needed for each SSTable. Before this commit, each reader and writer has its own `compressor` object. This isn't necessary, but it's okay. But later in this series it will stop being okay, because the creation of a `compressor` will become an expensive cross-shard operation (because it might require sharing a compression dictionary from another shard). So we have to adjust the code so that there is only once `compressor` per sstable, not one per reader/writer. We stuff the ownership of this compressor into `sstable::compression`. To make the ownership clear, we remove `compression_ptr` shared pointers from readers and writers, and make them access the compressor via the `sstable::compression` instead.	2025-04-01 00:07:27 +02:00
Michał Chojnowski	7bdcd5e8c1	compress: remove compressor::option_names() It used to be used by `compression_parameters` validation logic to ask the created `compressor` for compressor-specific option names. Since we no longer delegate this to `compressor`, but we just put the knowledge of those options directly into `compressor_parameters`, it's dead code now.	2025-04-01 00:07:27 +02:00
Michał Chojnowski	3b0ab8e1ee	compress: clean up the constructor of zstd_processor Since we now parse and validate the compression level during the construction of `compression_parameters`, we can just pass the structured params to `zstd_processor` instead of passing a raw string map.	2025-04-01 00:07:27 +02:00
Michał Chojnowski	6470035a74	compress: squash zstd.cc into compress.cc Unlike all other implementations of `compressor`, `zstd_processor` has its own special object file and its own special late binding mechanism (via the `class_registry`). It doesn't need either. Let's squash it into `compress.cc`. Keeping `zstd_processor` a separate "module" would require adding even more headers and source files later in the series (when adding dictionaries), and there's no benefit in being so granular. All `compressor` logic can be in `compress.cc` and it will still be small enough. This commit also gets rid of the pointless `class_registry` late binding mechanism and just constructs the `zstd_processor` in `compressor::create()` with a regular constructor call.	2025-04-01 00:07:27 +02:00
Michał Chojnowski	cfe69e057f	sstables/compress: break the dependency of `compression_parameters` on `compressor` Note: this commit is meant to be a code refactoring only and is not intended to change the observable behaviour. Today `schema` contains a `compression_parameters`. `compression_parameters` contains an instance of `compressor`, and SSTable writers just share that instance. This is fine because `compressor` is a stateless object, functionally dependent on the schema. But in later parts of the series, we will break this functional dependency by adding dictionaries to compressors. Two writers for the same schema might have different dictionaries, so they won't be able to just share a single instance contained in the schema. And when that happens, having a `compressor` instance in the `schema`/`compression_parameters` will become awkward, since it won't be actually used. It will be only a container for options. In addition, for performance reasons, we will want to share some pieces of compressors across shards, which will require -- in the general case -- a construction of a compressor to be asynchronous, and therefore not possible inside the constructor of `compression_parameters`. This commit modifies `compression_parameters` so that it doesn't hold or construct instances of `compressor`. Before this patch, the `compressor` instance constructed in `compression_parameters` has an additional role of validating and holding compressor-specific options. (Today the only such option is the zstd compression level). This means that the pieces of logic responsible for compressor-specific options have to be rewritten. That ends up being the bulk of this commit.	2025-04-01 00:07:27 +02:00
Michał Chojnowski	f4ca94d13b	compress.hh: switch compressor::name() from an instance member to a virtual call Before this patch, `compressor` is designed to be a proper abstract class, where the creator of a compressor doesn't even know what he's creating -- he passes a name, and it gets turned into a `compressor` behind a scenes. But later, when creation of compressors will involve looking up dictionaries, this abstraction will only get in the way. So we give up on keeping `compressor` abstract, and instead of using "opaque" names we turn to an explicit enum of possible compressor types. The main point of this patch is to add the `algorithm` enum and the `algorithm_to_name()` function. The rest of the patch switches the `compressor::name()` function to use `algorithm_to_name()` instead of the passed-by-constructor `compressor::_name`, to keep a single source of truth for the names.	2025-04-01 00:07:27 +02:00
Michał Chojnowski	4f634de2e9	bytes: adapt fmt_hex to std::span<const std::byte> This allows us to hexdump things other than `bytes_view`. (That is, without reinterpret_casting them to `bytes_view`, which -- aside from the inconvenience -- isn't quite legal. In contrast, any span can be legally casted to `std::span<const std::byte>`).	2025-04-01 00:07:27 +02:00
Robert Bindar	b647196121	Remove db::config::object_storage_config That map became redundant once we added object_storage_endpoints in the config, this patch removes it and switches all the user code to use the new option. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-03-31 17:15:12 +03:00
Gleb Natapov	3abe5de8bf	gossiper: make examine_gossiper private	2025-03-31 16:50:50 +03:00
Gleb Natapov	afdfde8300	gossiper: rename get_nodes_with_host_id to get_node_ip Also change it to return std::optional instead of std::set since now there can be only on ip mapped to an id.	2025-03-31 16:50:50 +03:00
Gleb Natapov	28fb84117d	treewide: drop id parameter from gossiper::for_each_endpoint_state We have it in endpoint_state anyway, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	4609bbbbb2	treewide: move gossiper to index nodes by host id This patch changes gossiper to index nodes by host ids instead of ips. The main data structure that changes is _endpoint_state_map, but this results in a lot of changes since everything that uses the map directly or indirectly has to be changed. The big victim of this outside of the gossiper itself is topology over gossiper code. It works on IPs and assumes the gossiper does the same and both need to be changed together. Changes to other subsystems are much smaller since they already mostly work on host ids anyway.	2025-03-31 16:50:50 +03:00
Gleb Natapov	19ac05b0ba	gossiper: drop ip from replicate function parameters We have it in endpoint_state now, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	c5b8429bec	gossiper: drop ip from apply_new_states parameters We have it in endpoint_state now, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	6da5f541a2	gossiper: drop address from handle_major_state_change parameter list We have it in endpoint_state now, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	5e06bf76e0	gossiper: pass rpc::client_info to gossiper_shutdown verb handler It will be needed later to obtain host id of the peer.	2025-03-31 16:50:50 +03:00
Gleb Natapov	704580b197	gossiper: add try_get_host_id function The function returns unengaged std::optional if id is not found instead of throwing like get_host_id does.	2025-03-31 16:50:45 +03:00
Tomasz Grabiec	29d1c2adc6	Merge 'Finalize tablet splits earlier' from Lakshmi Narayanan Sreethar Resize finalization is executed in a separate topology transition state, `tablet_resize_finalization`, to ensure it does not overlap with tablet transitions. The topology transitions into the `tablet_resize_finalization` state only when no tablet migrations are scheduled or being executed. If there is a large load-balancing backlog, split finalization might be delayed indefinitely, leaving the tables with large tablets. This PR fixes the issue by updating the load balancer to no schedule any migrations and to not make any repair plans when there a resize finalization is pending in any table. Also added a testcase to verify the fix. Fixes #21762 Improvement : No need to backport. Closes scylladb/scylladb#22148 * github.com:scylladb/scylladb: topology_coordinator: fix indentation in generate_migration_updates topology_coordinator: do not schedule migrations when there are pending resize finalizations load_balancer: make repair plans only when there is no pending resize finalization	2025-03-31 14:42:34 +02:00
Gleb Natapov	6999b474a1	gossiper: add ip to endpoint_state Store endpoint's IP in the endpoint state. Currently it is stored as a key in gossiper's endpoint map, but we are going to change that. The new filed is not serialized when endpoint state is sent over rpc, so it is set by the rpc handler from the value in the map that is in the rpc message. This map will not be changed to be host id based to not break interoperability.	2025-03-31 15:42:08 +03:00
Gleb Natapov	9bb2edcae6	serialization: fix std::map de-serializer to not invoke value's default constructor	2025-03-31 15:42:07 +03:00
Gleb Natapov	e5cc3b75f8	gossiper: drop template from wait_alive_helper function Move ip to id translation to the caller.	2025-03-31 15:42:07 +03:00
Gleb Natapov	0dd86b4f1d	gossiper: move get_supported_features and its users to host id	2025-03-31 15:42:07 +03:00
Gleb Natapov	f97bb6922d	storage_service: make candidates_for_removal host id based	2025-03-31 15:42:07 +03:00
Gleb Natapov	82491cec19	gossiper: use peers table to detect address change This requires serializing entire handle_state_normal with a lock since it both reads and updates peers table now (it only updated it before the change). This is not a big deal since most of it is already serialized with token metadata lock. We cannot use it to serialize peers writes as well since the code that removes an endpoint from peers table also removes it from gossiper which causes on_remove notification to be called and it may take the metadata lock as well causing deadlock.	2025-03-31 15:41:44 +03:00
Tomasz Grabiec	6bff596fce	tablets: Make tablet allocation equalize per-shard load Before, it was equalizing per-node load (tablet count), which is wrong in heterogenous clusters. Nodes with fewer shards will end up with overloaded shards. Refs #23378	2025-03-31 14:34:30 +02:00
Gleb Natapov	1c2a9257e9	storage_service: use std::views::keys instead of std::views::transform that returns a key	2025-03-31 15:25:39 +03:00
Gleb Natapov	a581a99dbf	gossiper: move _pending_mark_alive_endpoints to host id Index _pending_mark_alive_endpoints map by host id instead of ip	2025-03-31 15:25:39 +03:00
Gleb Natapov	555149c153	gossiper: do not allow to assassinate endpoint in raft topology mode It does nothing but harm in raft topology mode.	2025-03-31 15:25:39 +03:00
Gleb Natapov	4cc1c10035	gossiper: fix indentation after previous patch	2025-03-31 15:25:39 +03:00
Gleb Natapov	e8b7aaa0d4	gossiper: do not allow to assassinate non existing endpoint We assume that all endpoint states have HOST_ID set or the host id is available locally, but the assassinate code injects a state without HOST_ID for not existing endpoint violating this assumption.	2025-03-31 15:25:39 +03:00
Botond Dénes	90c20858ed	Merge 'test/database: Remove most of take_snapshot() helper overloads and re-use them more' from Pavel Emelyanov This helper facilitate snapshot creation by various test cases in database_test.cc. This PR generalizes all overloads into one that suits all callers and patches one more test case to use it as well. Closes scylladb/scylladb#23482 * github.com:scylladb/scylladb: test/database: Re-use take_snapshot() helper once more test/database: Remove most of take_snapshot() helper overloads	2025-03-31 15:20:51 +03:00
Benny Halevy	5f2ce0b022	loading_cache_test: test_loading_cache_reload_during_eviction: use manual_clock Rather than lowres_clock, as since `32b7cab917`, loading_cache_for_test uses manual_clock for timing and relying on lowres_clock to time the test might run out of memory on fast test machines. Fixes #23497 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23498	2025-03-31 14:53:06 +03:00
Robert Bindar	e3a3508960	Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-03-31 13:39:39 +03:00
Pavel Emelyanov	ac582efb44	test/database: Re-use take_snapshot() helper once more There's a test case that can call the recently patched take_snapshot() helper as well. This changes nothing, but makes further patching a bit simpler (not in this branch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-31 13:18:06 +03:00
Pavel Emelyanov	7e6380b6bd	test/database: Remove most of take_snapshot() helper overloads There are 3 of those that help tests (re)shuffle cql_test_env/database, skip_flush == true/false options and keyspace/table/snapshot names. There's little sense in having that many of those, just one overload with default arguments suits most of the callers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-31 13:18:06 +03:00
Botond Dénes	ea55eed037	Merge 'Snapshot several tables at once in scrub API handler' from Pavel Emelyanov The scrub API handler may want to snapshot several tables. For that, it calls snapshot-ctl method to snapshot a single table for each table in the list. That's excessive, snapshot-ctl has a method to snapshot a bunch of tables at once, just what the scrub handler needs. It's an improvement, so no need to backport Closes scylladb/scylladb#23472 * github.com:scylladb/scylladb: snapshot-ctl: Remove unused snapshot-single-table method api: Snapshot all tables at once in scrub handler	2025-03-31 13:00:32 +03:00
Piotr Smaron	aff8cbc6f3	CODEOWNERS: remove expired owners Removing krzaq, who's no longer with the company. Removing core-frontend team members from Alternator areas, as it's no longer the domain of this team. Closes scylladb/scylladb#23500	2025-03-31 11:37:51 +03:00
Pavel Emelyanov	0077acd1bb	api: Properly validate table in tablet add\|del replica handlers The handlers in question just go and call database.find_column_family, in case the table in question doesn't exist, the no_such_column_family exception would be thrown, which is not nice. Proper behavior is to throw bad_param one and there's a helper that does it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23389	2025-03-31 10:03:17 +02:00
Andrzej Jackowski	c89d8c6566	cql3: prevent from empty option use in cf_statement::column_family() Implementation of cf_statement::column_family() dereferences _cf_name option without checking if the option is non-empty. On enterprise branch, there is a safeguard that prevents from such an empty option dereferencing. Although the current code on master seems to not call columny_family() when _cf_name is empty, it is safer to introduce the same workaround on master, to avoid any regression. This change: - Prevent from empty option use in cf_statement::column_family() Fixes: scylla-enterprise#5273 Closes scylladb/scylladb#23366	2025-03-31 09:43:22 +03:00
Michał Chojnowski	e23fdc0799	table: fix a race in table::take_storage_snapshot() `safe_foreach_sstable` doesn't do its job correctly. It iterates over an sstable set under the sstable deletion lock in an attempt to ensure that SSTables aren't deleted during the iteration. The thing is, it takes the deletion lock after the SSTable set is already obtained, so SSTables might get unlinked before we take the lock. Remove this function and fix its usages to obtain the set and iterate over it under the lock. Closes scylladb/scylladb#23397	2025-03-31 09:40:32 +03:00
Avi Kivity	2b9e1e61d0	docs: reader_concurrency_semaphore: document CPU concurrency limit Document the CPU concurrency implemented in `3d816b7c16` and adjusted in `3d12451d1f`. Closes scylladb/scylladb#23404	2025-03-31 09:39:55 +03:00
Dawid Mędrek	b0b0c5905e	test/cluster/test_multidc: Clean up RF-rack-valid keyspaces tests There are some minor things we should fix that are a remnant of the original changes (scylladb/scylladb@7646e14). Closes scylladb/scylladb#23429	2025-03-31 09:38:42 +03:00
David Garcia	1a7be07b8c	docs: renders os-support from json file docs: renders os-support from json file Closes scylladb/scylladb#23436	2025-03-31 09:36:49 +03:00
Marcin Maliszkiewicz	e3f2ebd4fb	cql3: remove not needed cmd copy in indexed_table_select_statement It's not used variable. There should be a tiny perf increase as it saves allocation. Closes scylladb/scylladb#23473	2025-03-31 09:34:32 +03:00
Avi Kivity	73e4a3c581	sstables: store features early in write path sstable features indicate that an sstable has some extension, or that some bug was fixed. They allow us to know if we can rely on certain properties in a read sstables. Currently, sstable features are set early in the read path (when we read the scylla metadata file) and very late in the write path (when we write the scylla metadata file just before sealing the sstable). However, we happen to read features before we set them in the write path - when we resize the bloom filter for a newly written sstable we instantiate an index reader, and that depends on some features. As a result, we read a disengaged optional (for the scylla metadata component) as if it was engaged. This somehow worked so far, but fails with libstdc++ hash table implementation. Fix it by moving storage of the features to the sstable itself, and setting it early in the write path. Fixes #23484 Closes scylladb/scylladb#23485	2025-03-31 09:33:56 +03:00
Pavel Emelyanov	693387bda6	Merge 'test.py: topology: allow to run tests with bare pytest command' from Evgeniy Naydanov Add possibility to run topology tests using bare pytest command. To achieve this goal the following changes were made: - Add fixtures `testpy_testsuite` and `testpy_test` to `test/conftest.py`. - To build `TestSuite` object we need to discover a corresponding `suite.xml` file. Do this by walking up thru the fs tree starting from the current test file. - Run ScyllaClusterManager using pytest fixture if `--manager-api` option is not provided. And made some refactoring: - Add path constants to `test` module and use them in different test suites instead of own dups of the same code: - TOP_SRC_DIR : ScyllaDB's source code root directory - TEST_DIR : the directory with test.py tests and libs - BUILD_DIR : directory with ScyllaDB's build artifacts - Add TestSuite.log_dir attribute as a ScyllaDB's build mode subdir of a path provided using `--tmpdir` CLI argument. Don't use `tmpdir` name because it mixed up with pytest's built-in fixture and `--tmpdir` option itself. - Change default value for `--tmdir` from `./testlog` to `TOP_SRC_DIR/testlog` - Refactor `ResourceGather` classes to use path from a `test` object instead of providing it separately. - Move modes constants (`all_modes`/`ALL_MODES` and `debug_modes`/`DEBUG_MODES`) to `test` module and remove duplication. - Move `prepare_dirs()` and `start_3rd_party_services()` from `pylib.util` to`pylib.suite.base` to avoid circular imports. - In some places refactor to use f-strings for formatting. Also minor changes related to running with pytest-xdist: - When run tests in parallel we need to ensure that filenames are unique by adding xdist worker ID to them. - Pass random seed across xdist workers using env variable. Closes scylladb/scylladb#22960 github.com:scylladb/scylladb: test.py: async_cql: remove unused event_loop fixture test.py: random_failures: make it play well with xdist test.py: add xdist worker ID to log filenames test.py: topology: run tests using bare pytest command test.py: add fixtures for current test suite and test test.py: refactor paths constants and options	2025-03-31 09:30:06 +03:00
Benny Halevy	a4aa4d74c1	test/pylib: servers_add: add auto_rack_dc parameter To quickly populate nodes in a single dc, each node in its own rack. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-30 19:23:40 +03:00
Benny Halevy	c4dbb11c87	test/pylib: servers_add: support list of property_files So that a multi-dc/multi-rack cluster can be populated in a single call. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-30 19:12:39 +03:00
Piotr Smaron	a2bbbc6904	auth: forbid modifying system ks by non-superusers Before this patch, granting a user MODIFY permissions on ALL KEYSPACES allowed the user to write to system tables, where the user could also set himself to "superuser" granting him all other permissions. After this patch, MODIFY permissions on ALL KEYSPACES is limited only to non-system keyspaces. Fixes: scylladb/scylladb#23218 Closes scylladb/scylladb#23219	2025-03-30 16:55:04 +03:00
Ferenc Szili	2c9b312b58	test: port of test and reproducer for resurrection during file based streaming This change ports test/cluster/test_resurrection.py from enterprise to master. Because the underlying issue deals with file based streaming, this test was a part of the enterprise repo. It contains the test and reproducer for the issue described below: When tablets are migrated with file-based streaming, we can have a situation where a tombstone is garbage collected before the data it shadows lands. For instance, if we have a tablet replica with 3 sstables: 1 sstable containing an expired tombstone 2 sstable with additional data 3 sstable containing data which is shadowed by the expired tombstone in sstable 1 If this tablet is migrated, and the sstables are streamed in the order listed above, the first two sstables can be compacted before the third sstable arrives. In that case, the expired tombstone will be garbage collected, and data in the third sstable will be resurrected after it arrives to the pending replica. The fix for the issue was merged in `b66479ea98` This patch only ports the missing test. Closes scylladb/scylladb#23466	2025-03-30 13:39:40 +03:00
Andrzej Jackowski	b8adbcbc84	audit: fix empty query string in BATCH query Function modification_statement::add_raw() is never called, which makes query string in audit_info of batch queries empty. In enterprise branch, add_raw is called in Cql.g and those changes were never merged to master. This changes: - Add missing call of add_raw() to Cql.g - Include other related changes (from PR#3228 in scylla-enterprise) Fixes scylladb#23311 Closes scylladb/scylladb#23315	2025-03-30 13:37:11 +03:00
Michał Chojnowski	79a477ecb6	cmake: add the `-dynamic-linker=...` form to the -dynamic-linker regex On my system (Nix), the compiler produces a `-dynamic-linker=/nix/store/...` in the linker call scanned by get_padded_dynamic_linker_option. But the regex can't deal with the `=` there, it requires a ` `. Fix that. We also do the same in configure.py, and remove the Nix-specific hack which used to disable the entire mechanism. Closes scylladb/scylladb#22308	2025-03-30 11:58:47 +03:00
Kefu Chai	7814f6d374	github: improve seastar bad include check for better developer experience: - add inline annotations using problem matchers, see https://github.com/actions/toolkit/blob/main/docs/problem-matchers.md - use a single step for uploading both output files, because the `path` setting is actually passed to [@actions/glob](https://github.com/actions/toolkit/tree/main/packages/glob), i removed the double quotes and the leading "./" from the paths. - use "::error" workflow command to signify the failure, see https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#example-creating-an-annotation-for-an-error Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23310	2025-03-30 11:56:18 +03:00
Evgeniy Naydanov	1a0c14aa50	test.py: async_cql: remove unused event_loop fixture Newer version of pytest-asyncio (0.24.0) allows to control the scope of async loop per fixture. Don't need this workaround anymore.	2025-03-30 03:19:30 +00:00
Evgeniy Naydanov	cac0257914	test.py: random_failures: make it play well with xdist Pass random seed across xdist workers using env variable.	2025-03-30 03:19:30 +00:00
Evgeniy Naydanov	9bba59631f	test.py: add xdist worker ID to log filenames When run tests in parallel we need to ensure that filenames are unique by adding xdist worker ID to them.	2025-03-30 03:19:30 +00:00
Evgeniy Naydanov	9cb0ec2b42	test.py: topology: run tests using bare pytest command Run ScyllaClusterManager using pytest fixture if `--manager-api` option is not provided. On this stage we're trying to be as close to test.py as possible. test.py runs tests file-by-file, so, effectively, scopes `session`, `package`, and `module` are pretty same. Also, test.py starts ScyllaClusterManager for every test module and this is the reason why fixture `manager_api_sock_path` has scope=`module`. And, in result, we need to change scope for fixture `manager_internal` too.	2025-03-30 03:19:29 +00:00
Evgeniy Naydanov	42075170d1	test.py: add fixtures for current test suite and test Add fixtures `testpy_testsuite` and `testpy_test` to `test/conftest.py` To build TestSuite object we need to discover a corresponding `suite.xml` file. Do this by walking up thru the fs tree starting from the current test file.	2025-03-30 03:19:29 +00:00
Evgeniy Naydanov	c4ae4e247a	test.py: refactor paths constants and options Add path constants to `test` module and use them in different test suites instead of own dups of the same code: - TOP_SRC_DIR : ScyllaDB's source code root directory - TEST_DIR : the directory with test.py tests and libs - BUILD_DIR : directory with ScyllaDB's build artefacts Add TestSuite.log_dir attribute as a ScyllaDB's build mode subdir of a path provided using `--tmpdir` CLI argument. Don't use `tmpdir` name because it mixed up with pytest's built-in fixture and `--tmpdir` option itself. Change default value for `--tmdir` from `./testlog` to `TOP_SRC_DIR/testlog` Refactor `ResourceGather*` classes to use path from a `test` object instead of providing it separately. Move modes constants to `test` module and remove duplications. Move `prepare_dirs()` and `start_3rd_party_services()` from `pylib.util` to `pylib.suite.base` to avoid circular imports (with little refactoring to use `pathlib.Path` instead of `str` as paths.) Also, in some places refactor to use f-strings for formatting.	2025-03-30 03:19:29 +00:00
Michał Jadwiszczak	0ee0696959	test/cqlpy/test_service_level_api: update to service levels on raft and remove flakiness Tests in `test_service_level_api` were written before scylladb/scylladb#16585 and they were doing 10s sleeps to wait for service level controller to update its configuration. Now performing a read barrier is sufficient to ensure SL configuration is up-to-date, which significantly reduces tests time (from ~60s to ~2-3s). Moreover, there was flakiness in the `test_switch_tenants` test. Until now, the test waited up to 60s for the connections to update their scheduling groups. However, it is difficult to determine how long the process might take because a connection may be blocked while waiting for the next request to be processed, and the scheduling group will be updated only after a request is processed (see `generic_server::connection::process_until_tenant_switch()`). To address this issue, 100 simple queries are executed so that connections on all shards process at least one request and update their scheduling groups. Fixes scylladb/scylladb#22768 Closes scylladb/scylladb#23381	2025-03-28 17:14:21 +03:00
Pavel Emelyanov	9aa986a49a	snapshot-ctl: Remove unused snapshot-single-table method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-28 10:45:31 +03:00
Pavel Emelyanov	5162f75d0b	api: Snapshot all tables at once in scrub handler The handler walks the list of tables and snapshots each one individually (if needed). That's not very optimal, each such call starts a "snapshot modification operation", which is switching to shard-0 for a lock, then calls the snapshot of multiple tables giving it vector of a single name. There's a method of snapshot-ctl that snapshots several tables at once, no need to open-code it here. One thing to care about -- the take_column_family_snapshot() throws when the vector of table names is empty, so need an explicit skipping check. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-28 10:44:47 +03:00
Avi Kivity	6d7cb68aab	test: ldap: avoid io_uring Seastar reactor backend It tends to fail sometimes with ENOMEM: ``` ERROR 2025-03-24 01:05:22,983 [shard 0:sl:d] ldap_role_manager - error in reconnect: std::system_error (error C-Ares:4, server.that.will.never.exist.scylladb.com: Not found) ERROR 2025-03-24 01:05:30,984 [shard 0:sl:d] ldap_role_manager - error in reconnect: std::system_error (error C-Ares:4, server.that.will.never.exist.scylladb.com: Not found) ERROR 2025-03-24 01:05:47,123 [shard 0:main] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:12, Cannot allocate memory) ERROR 2025-03-24 01:05:47,139 [shard 0:main] table - failed to write sstable /scylladir/testlog/x86_64/debug/scylla-33787f64/system_schema/view_virtual_columns-08843b6345dc3be29798a0418295cfaa/me-3got_1s5n_0lfls1y4z7vkkts07a-big-Data.db: storage_io_error (Storage I/O error: 12: Cannot allocate memory) ERROR 2025-03-24 01:05:47,140 [shard 0:main] table - Memtable flush failed due to: storage_io_error (Storage I/O error: 12: Cannot allocate memory). Aborting, at 0x30f5605 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x4514f14 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x4514b96 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x45165b1 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x4518dcf 0x3fde842 0x35dc5c6 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36c26ed /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36cdd0c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d2cd2 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d0e56 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327f47a /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327c8f0 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1cdd4 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c79c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c69c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c184 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x34b2674 0x314b8b6 /lib64/libc.so.6+0x70ba7 /lib64/libc.so.6+0xf4b8b -------- seastar::internal::coroutine_traits_base<void>::promise_type -------- seastar::internal::coroutine_traits_base<void>::promise_type -------- seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)> >(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> -------- seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)> >(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> -------- seastar::shared_future<>::shared_state Aborting on shard 0, in scheduling group main. Backtrace: 0x30f5605 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x384a0e4 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x3849db2 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x369bd84 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d42a2 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x37a5ed9 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x37a61d5 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x37a601f /lib64/libc.so.6+0x1a04f /lib64/libc.so.6+0x72b53 /lib64/libc.so.6+0x19f9d /lib64/libc.so.6+0x1941 0x3fde8b1 0x35dc5c6 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36c26ed /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36cdd0c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d2cd2 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d0e56 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327f47a /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327c8f0 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1cdd4 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c79c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c69c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c184 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x34b2674 0x314b8b6 /lib64/libc.so.6+0x70ba7 /lib64/libc.so.6+0xf4b8b === TEST.PY SUMMARY START === Test exited with code -6 === TEST.PY SUMMARY END === === decoded === Backtrace: [Backtrace #0] __interceptor_backtrace at /mnt/clang_build/llvm-project-x86_64/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:4369 void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/debug/seastar/./seastar/include/seastar/util/backtrace.hh:70 seastar::backtrace_buffer::append_backtrace() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:805 seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:838 seastar::print_with_backtrace(char const, bool) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:850 seastar::sigabrt_action() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:4004 seastar::install_oneshot_signal_handler<6, (void ()())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t, void)#1}::operator()(int, siginfo_t, void) const at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3981 seastar::install_oneshot_signal_handler<6, (void ()())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t, void)#1}::__invoke(int, siginfo_t, void) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3976 /lib64/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=c8c3fa52aaee3f5d73b6fd862e39e9d4c010b6ba, for GNU/Linux 3.2.0, not stripped ?? ??:0 printf_positional at ??:? ?? ??:0 ?? ??:0 replica::table::seal_active_memtable(replica::compaction_group&, replica::flush_permit&&)::$_0::operator()(std::function<seastar::future<void> ()>) const at ././replica/table.cc:1512 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:122 seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:2616 seastar::reactor::run_some_tasks() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3088 seastar::reactor::do_run() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3256 seastar::reactor::run() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3146 seastar::app_template::run_deprecated(int, char, std::function<void ()>&&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/app-template.cc:276 seastar::app_template::run(int, char, std::function<seastar::future<int> ()>&&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/app-template.cc:167 seastar::testing::test_runner::start_thread(int, char)::$_0::operator()() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/testing/test_runner.cc:77 void std::__invoke_impl<void, seastar::testing::test_runner::start_thread(int, char)::$_0&>(std::__invoke_other, seastar::testing::test_runner::start_thread(int, char)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61 std::enable_if<is_invocable_r_v<void, seastar::testing::test_runner::start_thread(int, char)::$_0&>, void>::type std::__invoke_r<void, seastar::testing::test_runner::start_thread(int, char)::$_0&>(seastar::testing::test_runner::start_thread(int, char)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:111 std::_Function_handler<void (), seastar::testing::test_runner::start_thread(int, char)::$_0>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290 seastar::posix_thread::start_routine(void) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/posix.cc:90 asan_thread_start(void*) at /mnt/clang_build/llvm-project-x86_64/compiler-rt/lib/asan/asan_interceptors.cpp:239 __vfscanf_internal at :? peek_token at ??:? ``` In `ce65164315`, we banned io_uring from tests, but missed the ldap tests. This extends coverage to ldap tests. I verified that the new options indeed reach the test. Refs #23411. Credit to Botond for recognizing the failure reason. Closes scylladb/scylladb#23422	2025-03-28 07:45:53 +02:00
Tomasz Grabiec	d6232a4f5f	tablets: load_balancer: Fix reporting of total load per node Load is now utilization, not count, so we should report average per-shard load, which is equivalent to node's utilization.	2025-03-27 23:28:20 +01:00
Botond Dénes	bd8973a025	tools/scylla-nodetool: s/GetInt()/GetInt64()/ GetInt() was observed to fail when the integer JSON value overflows the int32_t type, which `GetInt()` uses for storage. When this happens, rapidjson will assign a distinct 64 bit integer type to the value, and attempting to access it as 32 bit integer triggers the wrong-type error, resulting in assert failure. This was hit on the field where invoking nodetool netstats resulted in nodetool crashing when the streamed bytes amounts were higher than maxint. To avoid such bugs in the future, replace all usage of GetInt() in nodetool of GetInt64(), just to be sure. A reproducer is added to the nodetool netstats crash. Fixes: scylladb/scylladb#23394 Closes scylladb/scylladb#23395	2025-03-27 14:05:39 +02:00
Botond Dénes	d57e71837f	Merge 'Improve scoped restore test' from Pavel Emelyanov This PR includes several fixes to the nowadays flaky test_restore_with_streaming_scopes test. 1. Check that backup and restore APIs don't fail. Currently, if either of them does the test cases fails anyway checking that the data is not restored back, but it's better to know what exactly failed 2. For restore API the test collects the list of sstables to restore from. Currently collecting this list races with background compaction and sometimes leads to restore API to fail which, in turn, makes the whole test to fail 3. Add a test case that validates that restore-from-missing-sstable fails nicely refs: #23189 No backport, as it's a relatively new test Closes scylladb/scylladb#23445 * github.com:scylladb/scylladb: test/backup: Validate that restoring from non-existing sstables fails test/backup: Collect sstables names after snapshot test/backup: Check that backup and restore succeed	2025-03-27 13:23:41 +02:00
Piotr Dulikowski	288216a89e	Merge 'Ignore wrapped exceptions `gate_closed_exception` and `rpc::closed_error` when node shuts down.' from Sergey Zolotukhin Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error` in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped in a `nested_exception`, an error message is printed, causing tests to fail. This commit adds handling for nested exceptions in this case to prevent unnecessary error messages. Fixes scylladb/scylladb#23325 Fixes scylladb/scylladb#23305 Fixes scylladb/scylladb#21815 Backport: looks like this is quite a frequent issue, therefore backport to 2025.1. Closes scylladb/scylladb#23336 * github.com:scylladb/scylladb: database: Pass schema_ptr as const ref in `wrap_commitlog_add_error` database: Unify exception handling in `do_apply` and `apply_with_commitlog` storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down. exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.	2025-03-27 11:39:42 +01:00
Pavel Emelyanov	9f036d957a	Merge 'test/clqpy/test_tool.py: get_sstables_for_table(): exclude non-sealed sstables' from Botond Dénes Filter out sstables which don't have a TOC or have a temporary TOC. Such sstables are incomplete and can dissapear if the compaction which writes them is interrupted. Fixes: #23203 This PR fixes a flaky test which is only on master, no backports required. Closes scylladb/scylladb#23450 * github.com:scylladb/scylladb: test/cqlpy/test_tools.py: test_scylla_sstable_query: reduce scope of no-compaction context test/clqpy/test_tool.py: get_sstables_for_table(): exclude non-sealed sstables	2025-03-27 09:45:07 +03:00
Tomasz Grabiec	8e506c5a8f	test: tablets: Fix flakiness due to ungraceful shutdown The test fails sporadically with: cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test3.test2 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1} That's becase a server is stopped in the middle of the workload. The server is stopped ungracefully which will cause some requests to time out. We should stop it gracefully to allow in-flight requests to finish. Fixes #20492 Closes scylladb/scylladb#23451	2025-03-27 09:44:07 +03:00
Lakshmi Narayanan Sreethar	dccce670c1	topology_coordinator: fix indentation in generate_migration_updates Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-03-27 10:16:34 +05:30
Lakshmi Narayanan Sreethar	5b47d84399	topology_coordinator: do not schedule migrations when there are pending resize finalizations Resize finalization is executed in a separate topology transition state, `tablet_resize_finalization`, to ensure it does not overlap with tablet transitions. The topology transitions into the `tablet_resize_finalization` state only when no tablet migrations are scheduled or being executed. If there is a large load-balancing backlog, split finalization might be delayed indefinitely, leaving the tables with large tablets. To fix this, do not schedule tablet migrations on any tables when there are pending resize finalizations. This ensures that migrations from the same table and other unrelated tables do not block resize finalization. Also added a testcase to verify the fix. Fixes #21762 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-03-27 10:16:34 +05:30
Lakshmi Narayanan Sreethar	8cabc66f07	load_balancer: make repair plans only when there is no pending resize finalization Do not make repair plans if any table has pending resize finalization. This is to ensure that the finalization doesn't get delayed by reapir tasks. Refs #21762 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-03-27 10:16:34 +05:30
Avi Kivity	b292b5800b	Merge 'test.py: move starting LDAP service to dedicate method' from Andrei Chekun Move starting LDAP to the method where the rest of the services are started. This will unify the way of starting the 3rd party services. Fix LDAP tests flakiness due not possible to connect to LDAP server. Add catching stdout and stderr of toxiproxy-cli in case of errors Related: https://github.com/scylladb/scylladb/pull/23333 This PR is based on https://github.com/scylladb/scylladb/pull/23221, so #23221 should be merged first. Closes scylladb/scylladb#23235 * github.com:scylladb/scylladb: test.py: Refactor nodetool/conftest test.py: Refactor test/pylib/cpp/ldap test.py: move starting LDAP service to dedicate method	2025-03-26 15:31:00 +02:00
Botond Dénes	801339bad9	test/cqlpy/test_tools.py: test_scylla_sstable_query: reduce scope of no-compaction context To just system.local, the table these tests operate on. No need to disable autocompaction for all of the system keyspace.	2025-03-26 09:19:38 -04:00
Botond Dénes	3ec863c4ce	test/clqpy/test_tool.py: get_sstables_for_table(): exclude non-sealed sstables Filter out sstables which don't have a TOC or have a temporary TOC. Such sstables are incomplete and can dissapear if the compaction which writes them is interrupted.	2025-03-26 09:18:34 -04:00
Pavel Emelyanov	1da889f239	Merge 'Allow abort during join_cluster' from Benny Halevy Bootstrap or replace can take a long time, but since `feef7d3fa1`, the stop_signal is checked only in checkpoints, and in particular, abort isn't requested during join_cluster. Fixes #23222 * requires backport on top of https://github.com/scylladb/scylladb/pull/23184 Closes scylladb/scylladb#23306 * github.com:scylladb/scylladb: main: allow abort during join_cluster main: add checkpoint before joining cluster storage_service: add start_sys_dist_ks	2025-03-26 15:48:58 +03:00
Sergey Zolotukhin	d448f3de77	database: Pass schema_ptr as const ref in `wrap_commitlog_add_error`	2025-03-26 11:15:26 +01:00
Sergey Zolotukhin	0d9d0fe60e	database: Unify exception handling in `do_apply` and `apply_with_commitlog` Move exception wrapping logic from `do_apply` and `apply_with_commitlog` to `wrap_commitlog_add_error` to ensure consistent error handling.	2025-03-26 11:15:18 +01:00
Sergey Zolotukhin	b1e89246d4	storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down. Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error` in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped in a `nested_exception`, an error message is printed, causing tests to fail. This commit adds handling for nested exceptions in this case to prevent unnecessary error messages. Fixes scylladb/scylladb#23325	2025-03-26 11:15:16 +01:00
Sergey Zolotukhin	6abfed9817	exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.	2025-03-26 11:15:13 +01:00
Evgeniy Naydanov	574c81eac6	test.py: random_failures: deselect topology ops for some injections After recent changes #18640 and #19151 started to reproduce for stop_after_sending_join_node_request and stop_after_bootstrapping_initial_raft_configuration error injections too. The solution is the same: deselect the tests. Fixes #23302 Closes scylladb/scylladb#23405	2025-03-26 12:07:12 +03:00
Pavel Emelyanov	38f37763d6	test/backup: Validate that restoring from non-existing sstables fails When restore API is called and is given a non-existing sstable (object name) the task should complete with failed status and some meaningful message in the error text. refs: #23189 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-26 10:55:42 +03:00
Pavel Emelyanov	02610a9072	test/backup: Collect sstables names after snapshot The scoped restoer test works like this - populate table - flush it - collect list of sstables - take snapshot - backup - restore (with the list of sstables as argument) - check the data is back Steps 2 and 3 are racy -- in case compaction comes in the middle, the list of collected sstables would differ from those snapshotted (and backuped) which will later lead to restore failure due to missing sstable. Fix by collecting the list of sstables after taking snapshot, and collect those not from the datadir, but from the snapshot dir. fixes: #23189 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-26 10:40:54 +03:00
Pavel Emelyanov	08004fe470	test/backup: Check that backup and restore succeed The scoped-restore test calls backup and restore APIs on several nodes, but doesn't check if any of the operations actually succeeds. Sometimes they indeed don't and test captures this, but in a weird manner -- the post-test checks for data presense fails, because the expected data is not in fact in its place. It's more debugging-friendly if we know in advance if backup or restore fails, rather than see that some data is missing after (failed) restore. refs: #23189 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-25 19:45:56 +03:00
Gleb Natapov	0aa4a82c83	messaging_service: do not call uninitialized _address_to_host_id_mapper std::function During messaging_service object creation remove_rpc_client function may be called if prefer_local snitch setting is true. The caller does not provide host id, so _address_to_host_id_mapper is called to obtain it, but at this point the function is not initialized yet. The patch fixes the code to not call the function if not initialized. This is not the problem since during messaging_service creation there is no connection to drop. Fixes: #23353 Message-ID: <Z-J2KbBK8NoFNYZZ@scylladb.com>	2025-03-25 18:41:16 +02:00
Wojciech Mitros	88d3fc68b5	alter_table_statement: fix renaming multiple columns in tables with views When we rename columns in a table which has materialized views depending on it, we need to also rename them in the materialized views' WHERE clauses. Currently, we do that by creating a new WHERE clause after each rename, with the updated column. This is later converted to a mutation that overwrites the WHERE clause. After multiple renames, we have multiple mutations, each overwriting the WHERE clause with one column renamed. As a result, the final WHERE clause is one of the modified clauses with one column renamed. Instead, we should prepare one new WHERE clause which includes all the renamed columns. This patch accomplishes this by processing all the column renames first, and only preparing the new view schema with the new WHERE clause afterwards. This patch also includes a test reproducer for this scenario. Fixes scylladb/scylladb#22194 Closes scylladb/scylladb#23152	2025-03-25 09:58:58 +01:00
Benny Halevy	9fac0045d1	boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 15:39:53 +02:00
Benny Halevy	62aeba759b	tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option `tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`. However, it does not allow to opt-out when creating new keyspaces by setting `tablets = {'enabled': false}`. Refs scylladb/scylla-enterprise#4355 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 15:32:16 +02:00
Benny Halevy	c62865df90	db/config: add tablets_mode_for_new_keyspaces option The new option deprecates the existing `enable_tablets` option. It will be extended in the next patch with a 3rd value: "enforced" while will enable tablets by default for new keyspace but without the posibility to opt out using the `tablets = {'enabled': false}` keyspace schema option. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 14:54:45 +02:00
Michael Litvak	49b8cf2d1d	storage_service: fix tablet split of materialized views This fixes an issue where materialized view tablets are not split because they are not registered as split candidates by the storage service. The code in storage_service::replicate_to_all_cores was changed in `4bfa3060d0` to handle normal tables and view tables separately, but with that change register_tablet_split_candidate is applied only to normal tables and not every table like before. We fix it by registering view tables as well. We add a test to verify that split of MV tables works. Closes scylladb/scylladb#23335	2025-03-24 08:23:58 +01:00
Pavel Emelyanov	79b9626d16	Merge 'service: do not include unused headers ' from Kefu Chai these unused includes were identified by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. also, updated the "iwyu.yaml" (short for include what you use) workflow to include "service" and "raft" subdirectories to prevent future regressions of including unused headers in them. --- it's a cleanup, hence no need to backport. Closes scylladb/scylladb#23373 * github.com:scylladb/scylladb: .github: add "raft" and "service" subdirectories to CLEANER_DIR service: do not include unused headers	2025-03-24 10:20:15 +03:00
Avi Kivity	cc5fe542ed	test: ignore unused fmt::to_string() result fmt 11.1 apparently marks to_string() as [[nodiscard]]. Here we aren't interested in the result, so explicitly ignore it to avoid an error. Closes scylladb/scylladb#23403	2025-03-24 10:19:09 +03:00
Avi Kivity	9d49c3254f	install-dependencies.sh: disabiguate python magic package There are in fact two python magic packages, file-magic (that binds to libmagic and comes from the file package), magic, an independent one. The name we use in install-depedencies.sh, python3-magic, resolves to file-magic. In Fedora 42, the resolution from the name python3-magic to file-magic was removed [1], and so install-dependencies.sh now tries to install the wrong magic package, which turns out not to coexist with the one we want anyway. Fix by naming python3-file-magic directly instead. Since this is what's installed in the current frozen toolchain, there's no need to regenerate it; we're just making the package list work in Fedora 42. [1] `81910b7d88` Closes scylladb/scylladb#23402	2025-03-24 10:18:27 +03:00
Avi Kivity	cd04ab1a4e	test: avoid spaces when defining user-defined literal operator Clang 20 complains when it sees a user-defined literal operator defined with a space before the underscore. Assume it's adhering to the standard and comply. Closes scylladb/scylladb#23401	2025-03-24 10:17:12 +03:00
Pavel Emelyanov	d436fb8045	Merge 'Fix EAR not applied on write to S3 (but on read).' from Calle Wilund Fixes #23225 Fixes #23185 Adds a "wrap_sink" (with default implementation) to sstables::file_io_extension, and moves extension wrapping of file and sink objects to storage level. (Wrapping/handling on sstable level would be problematic, because for file storage we typically re-use the sstable file objects for sinks, whereas for S3 we do not). This ensures we apply encryption on both read and write, whereas we previously only did so on read -> fail. Adds io wrapper objects for adapting file/sink for default implementation, as well as a proper encrypted sink implementation for EAR. Unit tests for io objects and a macro test for S3 encrypted storage included. Closes scylladb/scylladb#23261 * github.com:scylladb/scylladb: encryption: Add "wrap_sink" to encryption sstable extension encrypted_file_impl: Add encrypted_data_sink sstables::storage: Move wrapping sstable components to storage provider sstables::file_io_extension: Add a "wrap_sink" method. sstables::file_io_extension: Make sstable argument to "wrap" const utils: Add "io-wrappers", useful IO helper types	2025-03-24 10:12:46 +03:00
Artsiom Mishuta	8bb6414037	test.py: reuse clusters in Python suite PR https://github.com/scylladb/scylladb/pull/22274 was introduced due to CI instability and want to mark the cluster dirty after each test for topology But in fact, affects only Python suites that are quite stable, and CI was Stabilized by PR https://github.com/scylladb/scylladb/pull/22252 This PR get back cluster reusage in Python test suites Closes scylladb/scylladb#23179	2025-03-23 20:08:36 +02:00
Kefu Chai	fdc5255eb8	build: disable DPDK for all release builds Previously, DPDK was enabled by default in standard release builds but disabled in "release-pgo" and "release-cs-pgo" builds. This inconsistency caused linking warnings during PGO phase 2, when trained profiles from non-DPDK builds were used with DPDK-enabled builds: ``` [1980/1983] LINK build/release/scylla ld.lld: warning: /home/avi/scylla-maint/build/release/seastar/libseastar.a(reactor.cc.o at 57829248): function control flow change detected (hash mismatch) _ZN7seastar7reactor14run_some_tasksEv Hash = 2095857468992035112 up to 0 count discarded ld.lld: warning: /home/avi/scylla-maint/build/release/seastar/libseastar.a(reactor.cc.o at 57829248): function control flow change detected (hash mismatch) _ZN7seastar7reactor6do_runEv Hash = 2184396189398169723 up to 50134372 count discarded ld.lld: warning: /home/avi/scylla-maint/build/release/seastar/libseastar.a(reactor.cc.o at 57829248): function control flow change detected (hash mismatch) _ZN7seastar18syscall_work_queue11submit_itemESt10unique_ptrINS0_9work_itemESt14default_deleteIS2_EE Hash = 1533150042646546219 up to 1979931 count discarded ``` Since DPDK is not used in production and increases build time, this change disables DPDK across all release build types. This both silences the warnings and improves build performance. Fixes #23323 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23391	2025-03-23 15:26:10 +02:00
Avi Kivity	9adfb91f46	Merge 'Introduce s3 data_source_impl for optimized object streaming' from Pavel Emelyanov Currently, to stream data from sstable component the sstables code uses file_data_source_impl. In case the component is on S3, the s3::readable_file is put into that data source. The data source is configured with 128k buffers and at most 4 read-ahead-s. With that configuration, downloading full object from S3 becomes too slow -- GET-ing file with 128k requests is not nice even with 4 parallel read-ahead-s. Better solution for S3 downloading is to request way larger chunk with one GET and then produce smaller, 128k or alike, buffers upon data arrival. This is what the newly introduced data source impl does -- it spawns a background GET and lets the upper input stream read buffers directly from the arriving body. This PR doesn't yet make sstable layer use the new sink, just introduces it and adds unit and perf tests. Testing \|Test\|Download speed, MB/s\| \|-\|-\| \|file_input_stream (), 1 socket \| 4.996\| \|file_input_stream (), 2 sockets \| 9.403\| \|s3_data_source (*) \| 93.164\| () The file_input_stream test renders 128k GETs and is configured to issue at most 4 read-ahead-s (*) The s3_data_source uses at most 1 socket regardless of what perf-test configures it to refs: #22458 Closes scylladb/scylladb#22907 github.com:scylladb/scylladb: test: Extend s3-perf test with stream download one test/perf: Tune-up s3 test options parsing test: Add unit test for newly introduced download source s3/client: Introduce data_source_impl for object downloading s3/client: Detach format_range_header() helper	2025-03-23 14:22:04 +02:00
Pavel Emelyanov	ca3b604afa	test: Extend s3-perf test with stream download one Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-21 12:01:07 +03:00
Pavel Emelyanov	283e8e0706	test/perf: Tune-up s3 test options parsing Rename the `--upload bool` into `--operation string` one, so that new tests can be added in the future. Also rename run_download() to run_contiguous_get() because this is what the internals of this method do -- just GET contiguous ranges sequentially. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-21 12:01:07 +03:00
Pavel Emelyanov	bd313c581f	test: Add unit test for newly introduced download source Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-21 12:01:06 +03:00
Pavel Emelyanov	1f301b1c5d	s3/client: Introduce data_source_impl for object downloading The new data source implementation runs a single GET for the whole range specified and lends the body input_stream for the upper input_stream's get()-s. Eventually, getting the data from the body stream EOFs or fails. In either case, the existing body is closed and a new GET is spawn with the updater Range header so that not to include the bytes read so far. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-21 12:01:06 +03:00
Pavel Emelyanov	d47719f70e	s3/client: Detach format_range_header() helper The get_object_contiguous() formats the 'bytes=X-Y' one for its GET request. The very same code will be needed by next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-21 12:01:06 +03:00
Avi Kivity	7646e1448a	Merge 'cql3: Introduce RF-rack-valid keyspaces' from Dawid Mędrek This PR is an introductory step towards enforcing RF-rack-valid keyspaces in Scylla. The scope of changes: * defining RF-rack-valid keyspaces, * introducing a configuration option enforcing RF-rack-valid keyspaces, * restricting the CREATE and ALTER KEYSPACE statements so that they never lead to RF-rack invalid keyspaces, * during the initialization of a node, it verifies that all existing keyspaces are RF-rack-valid. If not, the initialization fails. We provide tests verifying that the changes behave as intended. --- Note that there are a number of things that still need to be implemented. That includes, for instance, restricting topology operations too. --- Implementation strategy (going beyond the scope of this PR): 1. Introduce the new configuration option `rf_rack_valid_keyspaces`. 2. Start enforcing RF-rack-validity in keyspaces if the option is enabled. 3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests. 4. Once the tests have been adjusted, change the default value of the option to enabled. 5. Stop explicitly enabling the option in tests. 6. Get rid of the option. --- Fixes scylladb/scylladb#20356 Fixes scylladb/scylladb#23276 Fixes scylladb/scylladb#23300 --- Backport: this is part of the requirements for releasing 2025.1. Closes scylladb/scylladb#23138 * github.com:scylladb/scylladb: main: Refuse to start node when RF-rack-invalid keyspace exists cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces db/config: Introduce RF-rack-valid keyspaces	2025-03-20 19:10:36 +02:00
Paweł Zakrzewski	0d14177409	audit/syslog: escape quotes and add explicit section names Before this change we outputted CSV-like structure, that looked like the following: Feb 27 12:31:30 scylla-audit: "10.200.200.41:0", "AUTH", "", "", "", "", "10.200.200.41:0", "cassandra", "false" While this is passably readable for humans, the ordering of fields is not clear and can be confusing. Furthermore, the `"` character (double quote) was not escaped. This is not an issue for CQL, but will be a problem for auditing Alternator, which will require logging JSON payloads. The new format will consist of key=value pairs and will escape the quote character, making it easy to parse programmatically. Feb 28 02:21:56 scylla-audit: node="10.200.200.41:0", category="AUTH", cl="", error="false", keyspace="", query="", client_ip="10.200.200.41:0", table="", username="cassandra" This is required for the auditing alternator feature. Closes scylladb/scylladb#23099	2025-03-20 19:55:51 +03:00
Calle Wilund	5c6337b887	encryption: Add "wrap_sink" to encryption sstable extension Creates a more efficient data_sink wrapper for encrypted output stream (S3).	2025-03-20 14:54:24 +00:00
Calle Wilund	9ac9813c62	encrypted_file_impl: Add encrypted_data_sink Adds a sibling type to encrypted file, a data_sink, that will write a data stream in the same block format as a file object would. Including end padding. For making encrypted data sink writing less cumbersome.	2025-03-20 14:54:24 +00:00
Calle Wilund	e02be77af7	sstables::storage: Move wrapping sstable components to storage provider Fixes #23225 Fixes #23185 Moved wrapping component files/sinks to storage provider. Also ensures to wrap data_sinks as well as actual files. This ensures that we actually write encryption if active.	2025-03-20 14:54:24 +00:00
Calle Wilund	d46dcbb769	sstables::file_io_extension: Add a "wrap_sink" method. Similar to wrap file, should wrap a data_sink (used for sstable writers), in obvious write-only, simple stream mode. Default impl will detect if we wrap files for this component, and if so, generate a file wrapper for the input sink, wrap this, and the wrap it in a file_data_sink_impl. This is obviously not efficient, so extensions used in actual non-test code should implement the method.	2025-03-20 14:54:22 +00:00
Calle Wilund	e100af5280	sstables::file_io_extension: Make sstable argument to "wrap" const This matches the signature of call sites. Since the only "real" extension to actually make a marker in the sstable will do so in the scylla component, which is writable even in a const sstable, this is ok.	2025-03-20 14:54:09 +00:00
Calle Wilund	98a6d0f79c	utils: Add "io-wrappers", useful IO helper types Mainly to add a somewhat functional file-impl wrapping a data_sink. This can implement a rudimentary, write-only, file based on any output sink. For testing, and because they fit there, place memory sink and source types there as well.	2025-03-20 14:54:09 +00:00
David Garcia	209ea2ea27	docs: update issues label Closes scylladb/scylladb#23304	2025-03-20 17:46:58 +03:00
Kefu Chai	c37149d106	test: stop using seastar::at_exit() seastar::at_exit() was marked deprecated recently. so let's use the recommended approach to perform cleanups. following tests were updated in this changes - scylla perf-tablets: tested with scylla perf-tablets - scylla perf-row-cache-update: tested with scylla perf-row-cache-update - scylla perf-fast-forward: tested with scylla perf-fast-forward --populate --run-tests small-partition-skips \ --smp 1 scylla perf-fast-forward --run-tests small-partition-skips \ --smp 1 - scylla perf-load-balancing: tested with scylla perf-load-balancing --nodes 3 --tablets1 16 --tablets2 16 --rf1 3 --rf2 3 --shards 16 - unit/row_cache_stress_test: tested with row_cache_stress_test --seconds 10 - perf/perf_cache_eviction: tested with ./perf_cache_eviction --seconds 1 --smp 1 - perf/perf_row_cache_reads: tested with ./perf_row_cache_reads Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23356	2025-03-20 17:44:57 +03:00
Ernest Zaslavsky	2fb5c7402e	s3_client: Rearrange credentials providers chain As the IAM role is not configured to assume a role at this moment, it makes sense to move the instance metadata credentials provider up in the chain. This avoids unnecessary network calls and prevents log clutter caused by failure messages. Closes scylladb/scylladb#23360	2025-03-20 17:43:04 +03:00
Pavel Emelyanov	23089e1387	Merge 'Enhance S3 client robustness' from Ernest Zaslavsky This PR introduces several key improvements to bolster the reliability of our S3 client, particularly in handling intermittent authentication and TLS-related issues. The changes include: 1. Automatic Credential Renewal and Request Retry: When credentials expire, the new retry strategy now resets the credentials and set the client to the retryable state, so the client will re-authenticate, and automatically retry the request. This change prevents transient authentication failures from propagating as fatal errors. 2. Enhanced Exception Unwrapping: The client now extracts the embedded std::system_error from std::nested_exception instances that may be raised by the Seastar HTTP client when using TLS. This allows for more precise error reporting and handling. 3. Expanded TLS Error Handling: We've added support for retryable TLS error codes within the std::system_error handler. This modification enables the client to detect and recover from transient TLS issues by retrying the affected operations. Together, these enhancements improve overall client robustness by ensuring smoother recovery from both credential and TLS-related errors. No backport needed since it is an enhancement Closes scylladb/scylladb#22150 * github.com:scylladb/scylladb: aws_error: Add GNU TLS codes s3_client: Handle nested std::system_error exceptions s3_client: Start using new retry strategy retry_strategy: Add custom retry strategy for S3 client retry_strategy: Make `should_retry` awaitable	2025-03-20 16:52:20 +03:00
Andrei Chekun	502b31d9c2	test.py: Refactor nodetool/conftest Remove using method for finding root dir of the project and start using the constant defined in package.	2025-03-20 11:41:30 +01:00
Andrei Chekun	1ea7b99385	test.py: Refactor test/pylib/cpp/ldap Rename and move prepare_instance from ldap tests directory to pylib/ldap_server.	2025-03-20 11:41:30 +01:00
Andrei Chekun	33e53565c4	test.py: move starting LDAP service to dedicate method Move starting LDAP to the method where the rest of the services are started. This will unify the way of starting the 3rd party services. Fix LDAP tests flakiness due not possible to connect to LDAP server Add catching stdout and stderr of toxiproxy-cli in case of errors	2025-03-20 11:37:04 +01:00
Pavel Emelyanov	339a849f13	transport: Remove connection::make_client_key() It's effectively unused, there's one place where connection initializes the client_data object using this helper, but that initialization looks better without it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23321	2025-03-20 10:22:05 +01:00
Calle Wilund	5cc3fc4f14	cluster/test_encryption: bring test from enterprise (and enable) Fixes scylladb/scylla-enterprise#5262 Part of the source-available code migration from scylla-enterprise.git to scylla.git. Original comment: topology_custom: add test_file_streaming_respects_encryption Reproducer for issue scylladb/scylla-enterprise#4246. Closes scylladb/scylladb#23320	2025-03-20 10:07:16 +02:00
Kefu Chai	ebf9125728	storage_proxy: Prevent integer overflow in abstract_read_executor::execute Fix UBSan abort caused by integer overflow when calculating time difference between read and write operations. The issue occurs when: 1. The queried partition on replicas is not purgeable (has no recorded modified time) 2. Digests don't match across replicas 3. The system attempts to calculate timespan using missing/negative last_modified timestamps This change skips cross-DC repair optimization when write timestamp is negative or missing, as this optimization is only relevant for reads occurring within write_timeout of a write. Error details: ``` service/storage_proxy.cc:5532:80: runtime error: signed integer overflow: -9223372036854775808 - 1741940132787203 cannot be represented in type 'int64_t' (aka 'long') SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior service/storage_proxy.cc:5532:80 Aborting on shard 1, in scheduling group sl:default ``` Related to previous fix `39325cf` which handled negative read_timestamp cases. Fixes #23314 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23359	2025-03-20 10:05:42 +02:00
Botond Dénes	d06bc27979	Merge 'Don't export string filenames from sstable' from Pavel Emelyanov There are several sstring-returning methods on class sstable that return paths to files. Mostly these are used to print them into logs, sometimes are used to be put into exception messages. And there are places that use these strings as file names. Since now sstables can also be stored on S3, generic code shouldn't consider those strings as on disk file names. Other than that, even when the methods are used to put component names into logs, in many cases these log messages come with debug or trace level, so generated strings are immediately dropped on the floor, but generating it is not extremely cheap. Code would benefit from using lazily-printed names. This change introduces the component_name struct that wraps sstable reference and component ID (which is a numerical enum of several items). When printed, the component_name formatter calls the aforementioned filename generation, thus implementing lazy printing. And since there's no automatic conversion of component_name-s into strings, all the code that treats them as file paths, becomes explicit. refs: #14122 (previous ugly attempt to achieve the same goal) Closes scylladb/scylladb#23194 * github.com:scylladb/scylladb: sstable: Remove unused malformed_sstable_exctpion(string filename) sstables: Make filename() return component_name sstables: Make file_writer keep component_name on board sstables: Make get_filename() return component_name sstables: Make toc_filename() return component_name sstables: Make sstable::index_filename() return component_name sstables: Introduce struct component_name sstables: Remove unused sstable::component_filenames() method sstables: Do not print component filenames on load-and-stream wrap-up sstables: Explicitly format prefix in S3 object name making sstables: Don't include directory name in exception sstables: Use fmt::format instead of string concatenation sstables: Rename filename($component) calls to ${component}_filename() sstables: Rename local filename variable to component_name	2025-03-20 09:51:03 +02:00
Kefu Chai	fd14a23aab	.github: add "raft" and "service" subdirectories to CLEANER_DIR in order to prevent future inclusion of unused headers, let's include "raft" and "service" subdirectories to CLEANER_DIR, so that this workflow can identify the regressions in future. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-03-20 11:18:16 +08:00
Kefu Chai	b3e2561ed8	service: do not include unused headers these unused includes were identified by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-03-20 11:18:16 +08:00
Avi Kivity	a62ab824e6	schema: deprecate schema_extension schema_extension allows making invisible changes to system_schema that evade upgrade rollback tests. They appear in system_schema as an encoded blob which reduces serviceability, as they cannot be read. Deprecate it and point users to adding explicit columns in scylla_tables. We could probably make use of the data structure, after we teach it to encode its payload into proper named and typed columns instead of using IDL. Closes scylladb/scylladb#23151	2025-03-19 20:36:16 +02:00
Kefu Chai	8fdaaf6491	service/storage_proxy: Improve digest comparison Previously, the code used a find_if to compare each digest to the first one to check for any mismatches. This was less readable. This change replaces that with `std::ranges::all_of`, which checks if all elements in the range are equal to the first digest, improving readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23332	2025-03-19 18:21:14 +03:00
Nadav Har'El	317de64281	test/alternator: enable debugging output during Python crashes For a long time now, we've been seeing (see #17564), once in a while, Alternator tests crashing with the Python process getting killed on SIGSEGV after the tests have already finished successfully and all pytest had to do is exit. We have not been able to figure out where the bug is. Unfortunately, we've never been able to reproduce this bug locally - and only rarely we see it in CI runs, and when it happens we don't any information on why it happend. So the goal of this patch is to print more information that might hopefully help us next time we see this problem in CI (this patch does NOT fix the bug). This patch adds to test/alternator's conftest.py a call to faulthandler.enable(). This traps SIGSEGV and prints a stack trace (for each thread, if there are several) showing what Python was trying to do while it is crashing. Hopefully we'll see in this output some specific cleanup function belonging to boto3 or urllib or whatever, and be able to figure out where the bug is and how to avoid it. We could have added this faulthandler.enable() call to the top-level conftest.py or to test.py, but since we only ever had this Python crash in Alternator tests, I think it is more suitable that we limit this desperate debugging attempt only to Alternator tests. Refs #17564 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23340	2025-03-19 18:18:51 +03:00
Dawid Mędrek	0e04a6f3eb	main: Refuse to start node when RF-rack-invalid keyspace exists When a node is started with the option `rf_rack_valid_keyspaces` enabled, the initialization will fail if there is an RF-rack-invalid keyspace. We want to force the user to adjust their existing keyspaces when upgrading to 2025.* so that the invariant that every keyspace is RF-rack-valid is always satisfied. Fixes scylladb/scylladb#23300	2025-03-19 15:13:44 +01:00
Dawid Mędrek	41f862d7ba	cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces In this commit, we refuse to create or alter a keyspace when that operation would make it RF-rack-invalid if the option `rf_rack_valid_keyspaces` is enabled. We provide two tests verifying that the changes work as intended. Fixes scylladb/scylladb#23276	2025-03-19 14:51:47 +01:00
Dawid Mędrek	32879ec0d5	db/config: Introduce RF-rack-valid keyspaces We introduce a new term in the glossary: RF-rack-valid keyspace. We also highlight in our user documentation that all keyspaces must remain RF-rack-valid throughout their lifetime, and failing to guarantee that may result in data inconsistencies or other issues. We base that information on our experience with materialized views in keyspaces using tablets, even though they remain an experimental feature. Along with the new term, we introduce a new configuration option called `rf_rack_valid_keyspaces`, which, when enabled, will enforce preserving all keyspaces RF-rack-valid. That functionality will be implemented in upcoming commits. For now, we materialize the restriction in form of a named requirement: a function verifying that the passed keyspace is RF-rack-valid. The option is disabled by default. That will change once we adjust the existing tests to the new semantics. Once that is done, the option will first be enabled by default, and then it will be removed. Fixes scylladb/scylladb#20356	2025-03-19 14:46:35 +01:00
Pavel Emelyanov	6e7d6b06f0	api: Squash two parse_table_infos into one There are currently three of them: - one that works on query parameter value - one that works on query parameters map - one that works on the request itself The second one is not used any longer by anyone by the third one, so squash them together. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 15:53:38 +03:00
Pavel Emelyanov	851bd38953	api: Generalize keyspaces:tables parsing a little bit more Continuation of the previous patch -- there's one caller that uses "non standard" name for the tables query parameter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 15:52:54 +03:00
Pavel Emelyanov	dc3455bc55	api: Provide general pair<keyspace, vector<table>> parsing Lots of API handlers get "keyspace" path parameter and parse the "cf" query one into a vector of table_infos. Generalize those places. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 15:51:57 +03:00
Pavel Emelyanov	722f282748	api: Remove ks_cf_func and related code The type in question is used by two endpoint handlers that are called with validated keyspace name and parsed vector of table_info-s. Both handlers can parse what they need on their own, all the more so next patches will make this parsing even more simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 15:49:55 +03:00
Pavel Emelyanov	73187a2e19	Merge 'mutation/mutation_consumer_concepts: simplify consumer hierarchy' from Botond Dénes The reader consumer concept hierarchy is a sprawling confusing jungle of deeply nested concepts. Looking at `FlattenedConsumer[V2]` -- the subject of this PR: this consumer is defined in terms of the `StreamedMutationConsumer[V2]` which in terms is defined in terms of the `FragmentConsumer[V2]`. This amount of nesting makes it really hard to see what a concept actually comes down to: made even more difficult by the fact that the concepts are scattered across two header files. In theory, this nesting allows for greater flexibility: some code can use a lower lever concept directly while it can also serve as the basis for the higher lever concepts. But the fact of the matter is that none of the lower level concepts are used directly, so we pay the price in hard-to-follow code for no benefit. This PR cuts down the complexity by folding up the entire hierarchy into the top-level `FlattenedConsumer[V2]` and `FlatteneConsumerReturning[V2]` concepts. Doing this immediately reveals just how similar the two major consumer concepts (`FlattenedConsumer[V2]` and `MutationFragmentConsumer[V2]`) supported by `mutation_reader` are. In a follow-up PR, we will attempt to unify the two. Refactoring, no backport needed. Closes scylladb/scylladb#23344 * github.com:scylladb/scylladb: mutation: fold FragmentConsumer[V2] into FlattenedConsumer[V2] mutation: fold StreamedMutationConsumer[V2] into FlattenedConsumer[V2] test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/	2025-03-19 15:43:00 +03:00
Pavel Emelyanov	a408a7abe1	sstable: Remove unused malformed_sstable_exctpion(string filename) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	f06cc32812	sstables: Make filename() return component_name Similarly to toc_, index_ and data filenames, make the generic component name getter return back not string, but a wrapper object. Most of callers are log messages and exception generations. Other than that there are tests, filesystem storage driver and few more places in generic code who "know" that they work with real files, so make them use explicit fmt::to_string(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	68c41f0459	sstables: Make file_writer keep component_name on board The class in question is a wrapper around output_stream that writes, flushes and closes the stream in async context. For logging it also keeps the component filename on board, and now it's good time to patch it and keep the component_filename instead. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	1ba91e28cb	sstables: Make get_filename() return component_name Similarly to previous patches -- mostly the result is used as log argument. The remaining users include - scylla sstable tool that dumps component names to json output - API endpoint that returns component names to user - tests these are all good to explicitly convert component_names to strings. There are few more places that expect strings instead of component name objects. For now they also use fmt::to_string() explicitly, partially it will be fixed later, mostly -- as future follow-ups. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	0cdeed858c	sstables: Make toc_filename() return component_name Most of the callers use the returned value as log message parameter, some construct malformed_sstable_exception that was prepared by previous patch. The remaining callers explicitly use fmt::to_string(), these are - pending deletion log creation - filesystem storage code - tests - stream-blob code that re-loads sstable All but the last one are OK to use string toc name, the last one is not very correct in its usage of toc_filename string, but it needs more care to be fixed properly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	80e0030613	sstables: Make sstable::index_filename() return component_name Most of the method callers use it as log parameter. There are few more places that push it to malformed_sstable_exception, which immediately converts it to string, so this patch makes the exception be constructed with the component_name either. And there's one more place that passes this string to file_writer constructor. For now, convert it to string explicitly, but next patches will fix that place to use pure component_name too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:01:23 +03:00
Pavel Emelyanov	dbb9ee15c1	sstables: Introduce struct component_name The structure wraps const reference to sstable and component_name value (it's an enum of several elements). It also has a formatter so that it can be directly printed in logs (main usage) as well as converted to strings (auxiliary and discourage usage). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	aba400f5d9	sstables: Remove unused sstable::component_filenames() method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	24e5c30cc8	sstables: Do not print component filenames on load-and-stream wrap-up When load-and-stream finishes it may call sstable::unlink() method to drop the loaded (and streamed) sstable. Before calling it it prints a log message about its intention that includes component_filenames() vector. This log message is ugly in several ways. First, it prints only recognized components, while unlink() method unlinks all of them, so it's sort of misleading (it doesn't seem that anyone ever read this message IRL though) Next, that's the only place that is _that_ verbose about sstable unlinking. "Common" unlinking paths don't print that much info. Finally, the log message happen in debug level, so it's hardly ever appears in any logs, but collecting several filenames takes time. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	fb2bd91009	sstables: Explicitly format prefix in S3 object name making Sometimes a component object name looks like s3://bucket/prefix/component. For that the path formatting code formats bucket name with the result of sstable->filename() invocation. This patch changes it to format bucket name, prefix itself and sstable->component_filename(). The change is idempotent, as sstable::filename() just concatenates prefix with sstable::component_filename(). This change will help to remove the former method from sstable soon. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	f212b5efa9	sstables: Don't include directory name in exception When filesystem storage throws an exception about failure to create components hardlinks, it includes three paths into it -- source file name, destination file name and the directory name. The directory name is excessive, source file name already has it. Also, this change will make it possible to remove one of malformed_sstable_exception constructors soon. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	a8bc81eb3c	sstables: Use fmt::format instead of string concatenation There are some places that concatentate filenames with something else to get different filename (tool does it) or message for exception (read_toc() helper). This patch uses fmt::format() instead to facilitate future patching. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	dcc9167734	sstables: Rename filename($component) calls to ${component}_filename() There's a generic sstable::filename(component_type) method that returns a file name for the given component. For "popular" components, namely TOC, Data and Index there are dedicated sstable methods to get their names. Fix existing callers of the generic method to use the former. It's shorter, nicer and makes further patching simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	e6898a8854	sstables: Rename local filename variable to component_name This is to be consistent with future changes and not to bloat them with extra renames Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:20 +03:00
Kefu Chai	1ab2b7e7a0	tree: fix misspellings these two misspellings were flagged by codespell. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23357	2025-03-19 09:13:20 +02:00
Botond Dénes	8f0d0daf53	Merge 'repair: allow concurrent repair and migration of two different tablets' from Aleksandra Martyniuk Do not hold erm during repair of a tablet that is started with tablet repair scheduler. This way two different tablets can be repaired and migrated concurrently. The same tablet won't be migrated while being repaired as it is provided by topology coordinator. Use topology_guard to maintain safety. Fixes: https://github.com/scylladb/scylladb/issues/22408. Needs backport to 2025.1 that introduces the tablet repair scheduler. Closes scylladb/scylladb#22842 * github.com:scylladb/scylladb: test: add test to check concurrent tablets migration and repair repair: do not hold erm for repair scheduled by scheduler repair: get total rf based on current erm repair: make shard_repair_task_impl::erm private repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary repair: pass session_id to repair_writer_impl::create_writer repair: keep materialized topology guard in shard_repair_task_impl repair: pass session_id to repair_meta	2025-03-19 08:55:24 +02:00
Kefu Chai	aca00118fb	service: fix misspellings these misspellings were flagged by codespell. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23334	2025-03-18 22:21:45 +02:00
Piotr Dulikowski	2ca1c0b6f9	Merge 'introduce the new Raft-based recovery procedure for group 0 majority loss' from Patryk Jędrzejczak This PR introduces the new Raft-based recovery procedure for group 0 majority loss. The Raft-based recovery procedure works with tablets. The old gossip-based recovery procedure does not because we have no code for tablet migrations after the gossip-based topology changes. The Raft-based procedure requires the Raft-based topology to be enabled in the cluster. If the Raft-based topology is not enabled, the gossip-based procedure must be used. We will be able to get rid of the gossip-based procedure when we make the Raft-based topology mandatory (we can do both in the same version, 2025.2 is the plan). Before we do it, we will have to keep both procedures and explain when each of them should be used. The idea behind the new procedure is to recreate group 0 without touching the topology structures. Once we create a new group 0, we can remove all dead nodes using the standard `removenode` and `replace` operations. For the procedure to be safe, we must ensure that each member of the new group 0 moves to the same initial group 0 state. Also, the only safe choice for the state is the latest persistent state available among the live nodes. The solution to the problem above is to ensure that the leader of the new group 0 (called the recovery leader) is one of the nodes with the latest state available. Other members will receive the snapshot from the recovery leader when they join the new group 0 and move to its state. Below is the shortened description of the new recovery procedure from the perspective of the administrator. For the full description, refer to the design document. 1. Find the set of live nodes. 2. Kill any live node that shouldn't be a member of the new group 0. 3. Ensure the full network connectivity between live nodes. 4. Rolling restart live nodes to ensure they are healthy and ready for recovery. 5. Check if some data could have been lost. If yes, restore it from backup after the recovery procedure. 6. Find the recovery leader (the node with the largest `group0_state_id`). 7. Remove `raft_group_id` from `system.scylla_local` and truncate `system.discovery` on each live node. 8. Set the new scylla.yaml parameter, `recovery_leader`, to Host ID of the recovery leader on each live node. 9. Rolling restart all live nodes, but the recovery leader must be restarted first. 10. Remove all dead nodes using `removenode` or `replace`. 11. Unset `recovery_leader` on all nodes. 12. Delete data of the old group 0 from `system.raft`, `system.raft_snaphots`, and `system.raft_snapshot_config`. In the future, we could automate some of these steps or even introduce a tool that will do all (or most) of them by itself. For now, we are fine with a procedure that is reliable and simple enough. This PR makes using 2025.1 with tablets much safer. We want to backport it to 2025.1. We will also want to backport a few follow-ups. Fixes scylladb/scylladb#20657 Closes scylladb/scylladb#22286 * github.com:scylladb/scylladb: test: mark tests with the gossip-based recovery procedure test: add tests for the Raft-based recovery procedure test: topology: util: fix the tokens consistency check for left nodes test: topology: util: extend start_writes gossip: allow group 0 ID mismatch in the Raft-based recovery procedure raft_group0: modify_raft_voter_status: do not add new members treewide: allow recreating group 0 in the Raft-based recovery procedure	2025-03-18 19:10:56 +01:00
Yaron Kaikov	b375222408	./github/scripts/auto-backport.py: don't remove backport label when backport process has an error Today, when the `Fixes` prefix is missing or the developer is not a collaborator with `scylladbbot` we remove the backport labels to prevent the process from starting and notifying the developers. Developers are worried that removing these backport labels will cause us to forget we need to do these backports. @nyh suggested to add a `scylladbbot/backport_error` label instead Applied those changes, so when a `Fixes` prefix is missing we will add a `scylladbbot/backport_error` label and stop the process When a user doesn't accept the invite we will still open the PR but he will not be assigned and will not be able to edit the branch when we have conflicts Fixes: https://github.com/scylladb/scylla-pkg/issues/4898 Fixes: https://github.com/scylladb/scylla-pkg/issues/4897 Closes scylladb/scylladb#23259	2025-03-18 16:19:09 +02:00
Pavel Emelyanov	420b5bee20	test/s3: Increase boost/s3_test log levels When something goes wrong, it's impossible to find anyting out without s3 and http logs, so increase them for boost tests. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23245	2025-03-18 15:59:05 +02:00
Botond Dénes	a2d0d7b9a0	mutation: fold FragmentConsumer[V2] into FlattenedConsumer[V2] FragmentConsumer[V2] also has no direct users, so fold it into FlattenedConsumer[V2] as well. With this, FlattenedConsumer[V2] has a nice and simple definition, with a single nesting level required due to the return-type flexibility.	2025-03-18 09:24:49 -04:00
Botond Dénes	8768e2e08e	mutation: fold StreamedMutationConsumer[V2] into FlattenedConsumer[V2] No code uses StreamedMutationConsumer[V2] directly, so let's take this opportunity to reduce the jungle of consumer concepts.	2025-03-18 09:24:44 -04:00
Botond Dénes	969b07fdfd	test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/ The class actually implements the FlattenedConsumer, so fix the comment. This eliminates the only reference to the StreamedMutationConsumer concept.	2025-03-18 07:57:04 -04:00
Avi Kivity	9867129c7b	Update seastar submodule * seastar 412d058cf9...2f13c461bb (2): > smp: prefaulter: don't leave zombie worker threads Fixes #23316 > demos/tcp_sctp_server_demo: Modernize with seastar::async and proper teardown Closes scylladb/scylladb#23317	2025-03-18 13:36:05 +02:00
Botond Dénes	2795d83b32	Merge 'commitlog: Serialize file deletion and distribute replayed segments' from Calle Wilund Fixes #23017 When deleting segments while our footprint is over the limit, mainly when recycling/deleting segments after replay (recover boot) we can cause two deletion passes to be running at the same time. This is because delete is triggered by either a.) replay release b.) timer check (explicit) c.) timer initiated flush callback where the last one is in fact not even waited for. If we are considering many files for delete/recycle, we can, due to task switch, end up considering segments ok to keep, in parallel, even though one of them should be deleted. The end result will be us keeping one more segment than should be allowed. Now, eventually, this should be released, once we do deletion again, but this can take a while. Solution is to simply ensure we serialize deletion. This might cause some delay in processing cycles for recycle, but in practice, this should never happen when we are in fact under pressure. As noted in the issue above, when replaying a large commitlog from an unclean node, we can cause shard 0 db commitlog to reach footprint limit, and then remain there (because we never release segments lower than limit). This is wasteful with diskspace. But deleting segments early here is also wasteful; A better solution is to simply give the segments to all CL shards, thus distributing the available space. Closes scylladb/scylladb#23150 * github.com:scylladb/scylladb: main/commitlog: wait for file deletion and distribute recycled segments to shards commitlog: Serialize file deletion	2025-03-18 11:47:17 +02:00
Avi Kivity	176bb464a2	github: error if we see #include "seastar/..." Seastar is a system library from ScyllaDB's persepective and so should use angle brackets for #include statements. Closes scylladb/scylladb#23308	2025-03-17 21:56:48 +02:00
Ernest Zaslavsky	08b9e4d87b	aws_error: Add GNU TLS codes Add GNU TLS error codes to std::system_error handler since we can start getting these once they seep from seastar's http client	2025-03-17 16:38:14 +02:00
Ernest Zaslavsky	012f0e6d8c	s3_client: Handle nested std::system_error exceptions Enhance error handling by detecting and processing std::system_error exceptions nested within std::nested_exception. This improvement ensures that system-level errors wrapped in the exception chain are properly caught and managed, leading to more robust error reporting and recovery.	2025-03-17 16:38:14 +02:00
Ernest Zaslavsky	367140a9c5	s3_client: Start using new retry strategy * Previously, token expiration was considered a fatal error. With this change, the `s3_client` uses new retry strategy that is trying to renew expired creds * Added related test to the `s3_proxy`	2025-03-17 16:38:14 +02:00
Ernest Zaslavsky	ed09614c27	retry_strategy: Add custom retry strategy for S3 client Introduced a new retry strategy that extends the default implementation. The should_retry method is overridden to handle a specific case for expired credential tokens. When an expired token error is detected, the credentials are reset so it is expected that the client will re-authenticates, and the original request is retried.	2025-03-17 16:38:14 +02:00
Ernest Zaslavsky	26062c65e4	retry_strategy: Make `should_retry` awaitable	2025-03-17 16:36:26 +02:00
Avi Kivity	0e4b303339	tools: toolchain: regenerate for python3-pytest-asyncio 0.24 Fixes a bug related to load_scope="module". python-driver fixed to version 3.28.2, as it looks like 3.29.0 regressed TLS handling [1]. In any case tools/cqlsh fixes it to 3.28.2. Optimized clang from https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz Ref #22960. Fixes #23213 [1] https://github.com/scylladb/python-driver/issues/456 Closes scylladb/scylladb#23236	2025-03-17 15:41:55 +02:00
Botond Dénes	fda3486770	Merge 'Remove some excessive ks:cf -> table_id conversions in API and schema_tables' from Pavel Emelyanov Actually, the main goal of this PR was to remove parse_tables() helpers from api/ in favor of more flexible (yet same complex) parse_table_infos(), but it turned out that it also saves some lookups in database maps. There are several places in API and schema_tables that have table_id at hand, but at some point drop it and carry keyspace and table names over to a place that maps ks:cf back to table_id and then uses it to find the table object. This PR keeps the table_id with the help of table_info struct in those places. This change allows removing the aforementioned parse_table() helpers from api/ and also saves few lookups in database maps. Removing the parse_tables() from api/ is the continuation of previous effort that reduces the set of helpers in api/ code that help handlers "parse" keyspaces and tables names see #22742 #21533 Closes scylladb/scylladb#23216 * github.com:scylladb/scylladb: api: Remove the remaining parse_tables() overload database: Sanitize flush_tables_on_all_shards() schema_tables: Remove all_table_names() database: Make tables flushing helper use table_info-s, not names api: Make keyspace flush endpoint use parse_table_infos() (and a bit more) schema_tables,client_state: Switch to using all_table_infos() schema_tables: Tune up some methods to benefit from table_infos schema_tables: Introduce all_table_infos()	2025-03-17 15:40:41 +02:00
Pavel Emelyanov	6217124d1d	s3/client: Make "expected" reply status truly optional Currently when a client::make_request() is called it can pass std::optional<status> argument indicating which status it expects from server. In case status doesn't match, the request body handler won't be called, the request will fail with unexpected status exception. However, disengaged expected implicitly means, that the requestor expects the OK (200) status. This makes it impossible to make a query which return status is not known in advance and it's up to the handler to check it. Lower level http client allows disengaged expected with the described semantics -- handler will check status its own. This behavios for s3 client is needed for GET request. Server can respond with OK or partial content status depending on the Range header. If the header is absent or is large enough for the requested object to fit into it, the status would be OK, if the object is "trimmed" the status is partial content. In the end of the day, requestor cannot "guess" the returning status in advance and should check it upon response arrival. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23243	2025-03-17 15:34:58 +02:00
Botond Dénes	afa305ffb4	Merge 'perf/perf_sstable: stop using at_exit() ' from Kefu Chai `seastar::at_exit()` was marked deprecated recently. so let's use the recommended approach to perform cleanups. --- it's a cleanup, hence no need to backport. Closes scylladb/scylladb#23253 * github.com:scylladb/scylladb: perf/perf_sstable: fix the indent perf/perf_sstable: stop using at_exit()	2025-03-17 15:30:10 +02:00
Andrei Chekun	d68e54c26d	test.py: Remove reuse cluster in cluster tests Pool is not aware of the cluster configuration, so it can return cluster to the test that is not suitable for it. Removing reuse will remove such possibility, so there will be less flaky tests. Closes scylladb/scylladb#23277	2025-03-17 15:27:59 +02:00
Calle Wilund	1525cb2dba	main/commitlog: wait for file deletion and distribute recycled segments to shards Refs #23017 When replaying a large commitlog from an unclean node, we can cause shard 0 db commitlog to reach footprint limit, and then remain there (because we never release segments lower than limit). This is wasteful with diskspace. But deleting segments early here is also wasteful; A better solution is to simply give the segments to all CL shards, thus distributing the available space. v2: * Do segement distribution using ranges. go c++23	2025-03-17 12:09:00 +00:00
Calle Wilund	4ed81e05bf	commitlog: Serialize file deletion Fixes #23017 When deleting segments while our footprint is over the limit, mainly when recycling/deleting segments after replay (recover boot) we can cause two deletion passes to be running at the same time. This is because delete is triggered by either a.) replay release b.) timer check (explicit) c.) timer initiated flush callback where the last one is in fact not even waited for. If we are considering many files for delete/recycle, we can, due to task switch, end up considering segments ok to keep, in parallel, even though one of them should be deleted. The end result will be us keeping one more segment than should be allowed. Now, eventually, this should be released, once we do deletion again, but this can take a while. Solution is to simply ensure we serialize deletion. This might cause some delay in processing cycles for recycle, but in practice, this should never happen when we are in fact under pressure. Small unit test included.	2025-03-17 12:09:00 +00:00
Anna Stuchlik	cd61f60549	doc: fix product names in the 2025.1 upgrage guides This commit fixes the product names in the upgrade 2025.1 guides so that: - 6.2 is preceded with "ScyllaDB Open Source" - 2024.x is preceded with "ScyllaDB Enterprise" - 2025.1 is preceded with "ScyllaDB" Fixes https://github.com/scylladb/scylladb/issues/23154 Closes scylladb/scylladb#23223	2025-03-17 13:54:11 +03:00
Anna Stuchlik	dbbf9e19e4	doc: remove the outdated info on seeds-info This commit removes the outdated information about seed nodes. We no longer need it in the docs, as a) the documentation is versioned, and b) the ScyllaDB Open Source 4.3 and ScyllaDB Enterprise 2021.1 versions mentioned in the docs are no longer supported. In addition, some clarification has been added to the existing sections. Fixes https://github.com/scylladb/scylladb/issues/22400 Closes scylladb/scylladb#23282	2025-03-17 13:53:48 +03:00
Andrei Chekun	7423edb1f7	test.py: Increase verbosity of pytest Currently, pytest truncates long objects in assertions. This makes understanding the failure message difficult. This will increase verbosity and pytest will stop truncating messages. Closes scylladb/scylladb#23263	2025-03-17 12:51:41 +02:00
Aleksandra Martyniuk	20f9d7b6eb	test: add test to check concurrent tablets migration and repair Add a test to check whether a tablet can be migrated while another tablet is repaired.	2025-03-17 10:37:03 +01:00
Aleksandra Martyniuk	5b792bdc98	repair: do not hold erm for repair scheduled by scheduler Do not hold erm for tablet repair scheduled by scheduler. Thanks to that one tablet repair won't exclude migration of other tablets. Concurrent repair and migration of the same tablet isn't possible, since a tablet can be in one type of transition only at the time. Hence the change is safe. Refs: https://github.com/scylladb/scylladb/issues/22408.	2025-03-17 10:37:02 +01:00
Aleksandra Martyniuk	a1375896df	repair: get total rf based on current erm Get total rf based on erm. Currently, it does not change anything because erm stays the same during the whole repair.	2025-03-17 10:36:18 +01:00
Aleksandra Martyniuk	34cd485553	repair: make shard_repair_task_impl::erm private Make shard_repair_task_impl::erm private. Access it with getter.	2025-03-17 10:36:14 +01:00
Andrei Chekun	a20d848c01	test.py: Refactor test/conftest.py Move functions responsible for preparation of the environment to the util file. This is extracted from https://github.com/scylladb/scylladb/pull/22894 to make it easier to work together. Closes scylladb/scylladb#23221	2025-03-17 11:31:00 +02:00
Avi Kivity	4416b0c732	treewide: use angle brackets for including seastar headers Seastar is an external library, so we use angle brackets to include its interfaces. Closes scylladb/scylladb#23301	2025-03-17 10:03:06 +02:00
Andrei Chekun	1e1d213592	test.py: Remove additional report generation for python tests Pytest is responsible for generation the report of the failed tests and there is no need to generate it one more time Closes scylladb/scylladb#23237	2025-03-17 09:36:08 +02:00
Kefu Chai	f8800b3f19	ent/encryption: rename "padd" to "padding"/"pad" and use structured bindings Replace the abbreviated term "padd" with either "padding" or "pad" throughout the encryption module. While "padd" was originally chosen to align with other variable names ("type" and "mode"), using standard terminology improves code readability and resolves codespell warnings. Additionally, refactor relevant code to use C++ structured bindings for cleaner implementation. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23251	2025-03-17 09:23:42 +02:00
Raphael S. Carvalho	e9944f0b7c	service: Introduce rack-aware co-location migrations for tablet merge Merge co-location can emit migrations across racks even when RF=#racks, reducing availability and affecting consistency of base-view pairing. Given replica set of sibling tablets T0 and T1 below: [T0: (rack1,rack3,rack2)] [T1: (rack2,rack1,rack3)] Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at only a subset of racks, reducing availability. This is the main problem fixed by this patch. It also lays the ground for consistent base-view replica pairing, which is rack-based. For tables on which views can be created we plan to enforce the constraint that replicas don't move across racks and that all tablets use the same set of racks (RF=#racks). This patch avoids moving replicas across racks unless it's necessary, so if the constraint is satisfied before merge, there will be no co-locating migrations across racks. This constraint of RF=#racks is not enforced yet, it requires more extensive changes. Fixes #22994. Refs #17265. This patch is based on Raphael's work done in PR #23081. The main differences are: 1) Instead of sorting replicas by rack, we try to find replicas in sibling tablets which belong to the same rack. This is similar to how we match replicas within the same host. It reduces number of across-rack migrations even if RF!=#racks, which the original patch didn't handle. Unlike the original patch, it also avoids rack-overloaded in case RF!=#racks 2) We emit across-rack co-locating migrations if we have no other choice in order to finalize the merge This is ok, since views are not supported with tablets yet. Later, we will disallow this for tables which have views, and we will allow creating views in the first place only when no such migrations can happen (RF=#racks). 3) Added boost unit test which checks that rack overload is avoided during merge in case RF<#racks 4) Moved logging of across-rack migration to debug level 5) Exposed metric for across-rack co-locating migrations Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com> Closes scylladb/scylladb#23247	2025-03-16 22:45:00 +02:00
Pavel Emelyanov	95809a3ed1	Update seastar submodule * seastar 5b95d1d7...412d058c (62): > fstream: Export functions for making file_data_source > build: Include DPDK dependency libraries in Seastar linkage > demos/tls_echo_server_demo: Modernize with seastar::async > http/client: Pass abort source by pointer > rpc: remove deprecated logging function support > github: Add Alpine Linux workflow to test builds with musl libc > exception_hacks: Make dl_iterate_phdr resolution manual > tests: relax test_file_system_space check for empty filesystems > demos/udp_server_demo: Modernize with seastar::async and proper teardown > future: remove deprecated functions/concepts > util: logger: remove deprecated set_stdout_enabled and logger_ostream_type::{stdout,stderr} > memory: guard __GLIBC_PREREQ usage with __GLIBC__ check > scheduling_specific: Add noexcept wrapper for free() > file: Replace __gid_t with standard POSIX gid_t > aio_storage_context: Use reactor::do_at_exit() > json2code: support chunked_fifo > json: remove unused headers > httpd: test cases for streaming > build: use find_dependency() instead find_package() in config file > build: stop using a loop for finding dependencies > dns: Fix event processing to work safely with recent c-ares > tutorial: add a section about initialization and cleanup > reactor: deprecate at_exit() > httpclient: Add exception handling to connection::close > file: document max_length-limits for dma_read/write funcs taking vector<iovec> > build: fix P2582R1 detection in GCC compatibility check > json2code: optimize string handling using std::string_view > tests/unit: fix typo in test output > doc: Update documentation after removing build.sh > test: Add direct exception passing for awaits for perf test > github: add Docker build verification workflow > docker: update LLVM debian repo for Ubuntu Orcular migration > tests/unit: Use http.HTTPStatus constants instead of raw status codes > tests/unit: Fix exception verification in json2code_test.py > httpd: handle streaming results in more handlers > json: stream_object now moves value > json: support for rvalue ranges > chunked_fifo: make copyable > reactor: deprecate at_destroy() > testing: prevent test scheduling after reactor exit > net: Add bytes sent/received metrics > net: switch rss_key_type to std::span instead of std::string_view > log: fixes for libc++ 19 > sstring: fixes for lib++ 19 > build: finalize numactl dependency removal > build: link DPDK against libnuma when detected during build > memory: remove libnuma dependency > treewide: replace assert with SEASTAR_ASSERT > future: fix typo in comment > http: Unwrap nested exceptions to handle retryable transport errors > net/ip, net: sed -i 's/to_ulong/to_uint/' > core: function_traits noexcept specializations > util/variant: seastar::visit forward value arg > net/tls: fix missing include > tls: Add a way to inspect peer certificate chain > websocket: Extract encode_base64() function > websocket: Rename wlogger to websocket_logger > websocket: Extract parts of server_connection usable for client > websocket: Rename connection to server_connection > websocket: Extract websocket parser to separate file > json2code_test: factor out query method > seastar-json2code: fix error handling Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23281	2025-03-16 21:57:43 +02:00
Benny Halevy	41f02c521d	main: allow abort during join_cluster Bootstrap or replace can take a long time, but since `feef7d3fa1`, the stop_signal is checked only in checkpoints, and in particular, abort isn't requested during join_cluster. Fixes #23222 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-16 12:21:15 +02:00
Benny Halevy	f269480f53	main: add checkpoint before joining cluster Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-16 12:08:04 +02:00
Benny Halevy	0fc196991a	storage_service: add start_sys_dist_ks Currently, there's a call to `supervisor::notify("starting system distributed keyspace")` which is misleading as it is identical to a similar message in main() when starting the sharded service. Change that to a storage_service log messages and be more specific that the sys_dist_ks shards are started. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-16 12:05:23 +02:00
Jenkins Promoter	d84da3dc11	Update pgo profiles - x86_64	2025-03-15 04:57:28 +02:00
Jenkins Promoter	6e8e2ae333	Update pgo profiles - aarch64	2025-03-15 04:48:49 +02:00
Pavel Emelyanov	604fdd86e9	test: Count mutation fragments verbosily in scoped restore test Sometimes after scoped restore a key is not found in nodes' mutation fragments. This patch makes the counting more verbose to get better understanding of what's going on in case of test failure refs: #23189 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23296	2025-03-14 21:31:36 +02:00
Pavel Emelyanov	bfbe802632	streaming: Relax load_sstable_for_tablet() The method does several excessive things, that can be relaxed 1. In order to transfer a table-id to another shard, finds the table on source shard, gets schema and captures schema id on invoke_on()'s lambda. It can just capture the original table-id 2. In order to get sstable parameters (format, version, etc.) generates toc_filename(), then calls parse_path() to convert it into the entry_descriptor. The descriptor can be read from sstable directly. 3. Logging "success" includes target shard into the message, but happens on the source shard. The message can be just logged on target shard. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23197	2025-03-14 15:26:48 +02:00
Botond Dénes	39bcf99f8e	Merge 'Apply hard limit to partition range vectors in secondary index queries' from Nikos Dragazis Secondary index queries fetch partition keys from the index view and store them in an `std::vector`. The vector size is currently limited by the user's page size and the page memory limit (1MiB). These are not enough to prevent large contiguous allocations (which can lead to stalls). This series introduces a hard limit to the vector size to ensure it does not exceed the allocator's preferred max contiguous allocation size (128KiB). With the size of each element being 120 bytes, this allows for 1092 partition keys. The limit was set to 1000. Any partitions above this limit are discarded. Discarding partitions breaks the querier cache on the replicas, causing a performance regression, as can be seen from the following measurements: ``` * Cluster: 3 nodes (local Docker containers), 1 vCPU, 4GB memory, dev mode * Schema: CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1': '3'} AND durable_writes = true AND tablets = {'enabled': false}; CREATE TABLE ks.t1 (pk1 int, pk2 int, ck int, value int, PRIMARY KEY ((pk1, pk2), ck)); CREATE INDEX t1_pk2_idx ON ks.t1(pk2); * Query: CONSISTENCY LOCAL_QUORUM; SELECT * FROM ks.t1 where pk2 = 1; +------------+-------------------+-------------------+ \| Page Size \| Master \| Vector Limit \| +============+===================+===================+ \| \| Latency (sec) \| Latency (sec) \| +------------+-------------------+-------------------+ \| 100 \| 5.80 ± 0.13 \| 5.64 ± 0.10 \| +------------+-------------------+-------------------+ \| 1000 \| 4.77 ± 0.07 \| 4.62 ± 0.06 \| +------------+-------------------+-------------------+ \| 2000 \| 4.67 ± 0.07 \| 5.13 ± 0.03 \| +------------+-------------------+-------------------+ \| 5000 \| 4.82 ± 0.09 \| 6.25 ± 0.06 \| +------------+-------------------+-------------------+ \| 10000 \| 4.89 ± 0.36 \| 7.52 ± 0.13 \| +------------+-------------------+-------------------+ \| -1 \| 4.90 ± 0.67 \| 4.79 ± 0.33 \| +------------+-------------------+-------------------+ ``` We expect this to be fixed with adaptive paging in a future PR. Until then, users can avoid regressions by adjusting their page size. Additionally, this series changes the `untyped_result_set` to store rows in a `chunked_vector` instead of an `std::vector`, similarly to the `result_set`. Secondary index queries use an `untyped_result_set` to store the raw result from the index view before processing. With 1MiB results, the `std::vector` would cause a large allocation of this magnitude. Finally, a unit test is added to reproduce the bug. Fixes #18536. The PR fixes stalls of up to 100ms, but there is an easy workaround: adjust the page size. No need to backport. Closes scylladb/scylladb#22682 * github.com:scylladb/scylladb: cql3: secondary index: Limit page size for single-row partitions cql3: secondary index: Limit the size of partition range vectors cql3: untyped_result_set: Store rows in chunked_vector test: Reproduce bug with large allocations from secondary index	2025-03-14 15:06:07 +02:00
Botond Dénes	83ea1877ab	Merge 'scylla-sstable: add native S3 support' from Ernest Zaslavsky scylla-sstable: Enable support for S3-stored sstables Minimal implementation of what was mentioned in this [issue](https://github.com/scylladb/scylladb/issues/20532) This update allows Scylla to work with sstables stored on AWS S3. Users can specify the fully qualified location of the sstable using the format: `s3://bucket/prefix/sstable_name`. One should have `object_storage_config_file` referenced in the `scylla.yaml` as described in docs/operating-scylla/admin.rst ref: https://github.com/scylladb/scylladb/issues/20532 fixes: https://github.com/scylladb/scylladb/issues/20535 No backport needed since the S3 functionality was never released Closes scylladb/scylladb#22321 * github.com:scylladb/scylladb: tests: Add Tests for Scylla-SSTable S3 Functionality docs: Update Scylla Tools Documentation for S3 SSTable Support scylla-sstable: Enable Support for S3 SSTables s3: Implement S3 Fully Qualified Name Manipulation Functions object_storage: Refactor `object_storage.yaml` parsing logic	2025-03-14 15:05:52 +02:00
Patryk Jędrzejczak	ca5c223505	test: mark tests with the gossip-based recovery procedure This patch makes it clear which Raft recovery procedure is used in each test. Tests with "This test uses the gossip-based recovery procedure." are the tests that use the gossip-based topology. This tests should be deleted once we make the Raft-based topology mandatory. Tests with the new FIXME are the tests that use the Raft-based topology. They should be changed to use the Raft-based recovery procedure or removed if they don't test anything important with the new procedure.	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	4fd0e93154	test: add tests for the Raft-based recovery procedure	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	4e055882c1	test: topology: util: fix the tokens consistency check for left nodes When we remove a node in the Raft-based topology (by remove/replace/decommission), we remove its tokens from `system.topology`, but we do not change `num_tokens`. Hence, the old check could fail for left nodes.	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	d0efc77d20	test: topology: util: extend start_writes We extend `start_writes` to allow: - providing `ks_name` from the test, - restarting it (by starting it again with the same `ks_name`), - running it in the presence of shutdowns. We use these features in a new test in one of the following patches.	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	9970c1fcc3	gossip: allow group 0 ID mismatch in the Raft-based recovery procedure This patch ensures that members of the new group 0 can gossip with members of the old group 0 during rolling restart in the Raft-based recovery procedure. Without this change, restarted nodes (members of the new group 0) wouldn't be marked as UP by other nodes (members of the old group 0), which would decrease availability.	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	3b9765dac8	raft_group0: modify_raft_voter_status: do not add new members In the new Raft-based recovery procedure, we create a new group 0. Dead nodes are not members of this group 0. Also, the removenode handler makes a node being removed a non-voter. So, with the previous implementation of `modify_raft_voter_status`, the node being removed would become a non-voting member of the new group 0, which is very weird. It should not cause problems, but we better avoid it and keep the procedure clean. This change also makes `modify_raft_voter_status` more intuitive in general.	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	fd51d7e448	treewide: allow recreating group 0 in the Raft-based recovery procedure This patch adds support for recreating group 0 after losing majority. This is the only part of the new Raft-based recovery procedure that touches Scylla core. The following steps are necessary to recreate group 0: 1. Determine the new group 0 members. These are alive nodes that are normal or rebuilding. 2. Choose the recovery leader - the node which will become the new group 0 leader. This must be one of the nodes with the latest persistent group 0 state. 3. Remove `raft_group_id` from `system.scylla_local` and truncate `system.discovery` on each live node. 4. Set the new scylla.yaml parameter - `recovery_leader` - to Host ID of the recovery leader on each live node. 5. Rolling restart all live nodes, but the recovery leader must be restarted first. In the implementation, restarts in step 5 are very similar to normal restarts with the Raft-based topology enabled. The only differences are: 1. Steps 3-4 make the restarting node discover the new group 0 in `join_cluster`. 2. The group 0 server is started in `join_group0`, not `setup_group0_if_exists`. 3. The restarting node joins the new group 0 in `join_topology` using `legacy_handshaker`. There is no reason to contact the topology coordinator since the node has already joined the topology. Unfortunately, this patch creates another execution path for the starting logic. `join_cluster` becomes even messier. However, there is nothing we can do about it. Joining group 0 without joining topology is something completely new. Having a few small changes without touching other execution paths is the best we can do. We will start removing the old stuff soon, after making the Raft-based topology mandatory, and the situation will improve.	2025-03-14 13:52:57 +01:00
Nadav Har'El	de7c1d526a	test/cqlpy: test DESC doesn't list an index as a view Issue #6058 complained that "DESCRIBE TABLE" or "DESCRIBE KEYSPACE" list a secondary index as materialized view (the view used to back the index in Scylla's implementation of secondary indexes). This patch adds a test to verify that this issue no longer exists in server-side describe - so we can mark the issue as fixed. While preparing this test, I noticed that Scylla and Cassandra behave differently on whether DESC TABLE should list materialized views or not, so this patch also includes a test for that as well - and I opened issue #23014 on Scylla and CASSANDRA-20365 on Cassandra to further discuss that new issue. Fixes #6058 Refs #23014. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23015	2025-03-14 14:40:19 +03:00
Nadav Har'El	c0821842de	alternator: document the state of tablet support in Alternator In commit `c24bc3b` we decided that creating a new table in Alternator will by default use vnodes - not tablets - because of all the missing features in our tablets implementation that are important for Alternator, namely - LWT, CDC and Alternator TTL. We never documented this, or the fact that we support a tag `experimental:initial_tablets` which allows to override this decision and create an Alternator table using tablets. We also never documented what exactly doesn't work when Alternator uses tablet. This patch adds the missing documentation in docs/alternator/new-apis.md (which is a good place for describing the `experimental:initial_tablets` tag). The patch also adds a new test file, test_tablets.py, which includes tests for all the statements made in the document regarding how `experimental:initial_tablets` works and what works or doesn't work when tablets are enabled. Two existing tests - for TTL and Streams non-support with tablets - are moved to the new test file. When the tablets feature will finally be completed, both the document and the tests will need to be modified (some of the tests should be outright deleted). But it seems this will not happen for at least several months, and that is too long to wait without accurate documentation. Fixes #21629 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#22462	2025-03-14 14:03:15 +03:00
Pavel Emelyanov	2bb455ec75	Merge 'Main: stop system_keyspace' from Benny Halevy This series adds an async guard to system_keyspace operations and adds a deferred action to stop the system_keyspace in main() before destroying the service. This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped. * Enhancement, no backport needed Closes scylladb/scylladb#23113 * github.com:scylladb/scylladb: main: stop system keyspace system_keyspace: call shutdown from stop system_keyspace: shutdown: allow calling more than once database, compaction_manager, large_data_handler: use pluggable<system_keysapce> utils: add class pluggable	2025-03-14 13:23:28 +03:00
Aleksandra Martyniuk	444c7eab90	repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary When small_table_optimization isn't enabled, put_row_diff_with_rpc_stream does not access erm. Pass small_table_optimization_params containing erm only when small_table_optimization is enabled. This is safe as erm is kept by shard_repair_task_impl.	2025-03-14 10:45:52 +01:00
Aleksandra Martyniuk	e56bb5b6e2	repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary When small_table_optimization isn't enabled, flush_rows_in_working_row_buf does not access erm. Add small_table_optimization_params containing erm and pass it only when small_table_optimization is enabled. This is safe as erm is kept by shard_repair_task_impl.	2025-03-14 10:45:52 +01:00
Aleksandra Martyniuk	09c74aa294	repair: pass session_id to repair_writer_impl::create_writer	2025-03-14 10:45:52 +01:00
Aleksandra Martyniuk	47bb9dcf78	repair: keep materialized topology guard in shard_repair_task_impl Keep materialized topology guard in shard_repair_task_impl and check it in check_in_abort_or_shutdown and before each range repair.	2025-03-14 10:41:10 +01:00
Aleksandra Martyniuk	928f92c780	repair: pass session_id to repair_meta Pass session_id of tablet repair down the stack from the repair request to repair_meta. The session_id will be utiziled in the following patches.	2025-03-14 10:20:12 +01:00
Nadav Har'El	a72dde2ee6	test/cqlpy: add test for long table names Scylla inherited a 48-character limit on the length of table (and keyspace) names from Cassandra 3. It turns out that Cassandra 4 and 5 unintentionally dropped this limit (see history lesson in CASSANDRA-20425), and now Cassandra accepts longer table names. Some Cassandra users are using such longer names and disappointed that Scylla doesn't allow them. This patch includes tests for this feature. One test tries a 48-character table name - it passes on Scylla and all versions of Cassandra. A second test tries a 100-character table name - this one passes on Cassandra version 4 and above (but not on 3), and fails on Scylla so marked "xfail". A third test tries a 500-character table name. This one fails badly on Cassandra (see CASSANDRA-20389), but passes on Scylla today. This test is important because we need to be sure that it continues to pass on Scylla even after the Scylla is fixed to allow the 100-character test. Refs #4480 - an issue we already have about supporting longer names Note on the test implementation: Ideally, the test for a particular table-name length shouldn't just create the table - it should also make sure we can write table to it and flush it, i.e., that sstables can get written correctly. But in practice, these complications are not needed, because in modern Scylla it is the directory name which contains the table's name, and the individual sstable files do not contain the table's name. Just creating the table already creates the long directory name, so that is the part that needs to be tested. If we created this directory successfully, later creating the short-named sstables inside it can't fail. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23229	2025-03-14 11:15:07 +03:00
Kefu Chai	a82cfbecad	test: perf_sstable: close frag_stream before destoying it the underlying reader should be closed before being destroyed. otherwise we'd have following failure when testing the "full_scan_streaming": ``` $ scylla perf-sstable --parallelism 1 --iterations 20 --partitions 20 --testdir /tmp/sstable --mode full_scan_streaming ERROR 2025-03-13 15:04:26,321 [shard 0:main] mutation_reader - N8sstables2mx27mx_sstable_full_scan_readerE [0x60015a36b650]: permit .:test: was not closed before destruction, at: 0x235931e 0x2359470 0x239deb3 0x62a1ed3 0x89fd156 0x89c3fba 0x22a6ed3 0x22a8fea 0x22aae17 0x22a9928 0x26bb7d0 0x26bbe3e 0x89bca67 0x246bd8d /lib64/libc.so.6+0x3247 /lib64/libc.so.6+0x330a 0x1657774 ------ seastar::internal::coroutine_traits_base<double>::promise_type ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23270	2025-03-14 11:12:44 +03:00
Piotr Smaron	d365d9b2ad	test/ldap: assign non-busy ports to ldap It may happen that the ports we randomly choose for LDAP are busy, and that'd fail the test suite, so once we randomly select ports, now we'll see if they're busy or not, and if they're busy, we'll select next ones, until we finally have some free ports for LDAP. Tested with: `./test.py ldap/ldap_connection_test --repeat 1000 -j 10`: before the fix, this command fails after ~112 runs, and of course it passes with the fix. Fixes: scylladb/scylla-enterprise#5120 Fixes: scylladb/scylladb#23149 Fixes: scylladb/scylladb#23242 Closes scylladb/scylladb#23275	2025-03-14 11:09:19 +03:00
Botond Dénes	68b2ac541c	Merge 'streaming: fix the way a reason of streaming failure is determined' from Aleksandra Martyniuk During streaming receiving node gets and processes mutation fragments. If this operation fails, receiver responds with -1 status code, unless it failed due to no_such_column_family in which case streaming of this table should be skipped. However, when the table was dropped, an exception handler on receiver side may get not only data_dictionary::no_such_column_family, but also seastar::nested_exception of two no_such_column_family. Encountered example: ``` ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14)) ``` In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family> clause and gets handled the same as any other exception type. Replace try_catch clause with table_sync_and_check that synchronizes the schema and check if the table exists. Fixes: https://github.com/scylladb/scylladb/issues/22834. Needs backport to all live version, as they all contain the bug Closes scylladb/scylladb#22868 * github.com:scylladb/scylladb: streaming: fix the way a reason of streaming failure is determined streaming: save a continuation lambda streaming: use streaming namespace in table_check.{cc,hh} repair: streaming: move table_check.{cc,hh} to streaming	2025-03-14 07:25:00 +02:00
Kefu Chai	31320399e8	test: sstable_test: use `auto` instead of `statistics` to avoid name collision Replace explicit `statistics` type with `auto` in sstable_test to resolve name collision. This addresses ambiguity introduced by commit 87c221cb which added `struct statistics` in `seastar/include/seastar/net/api.hh`, conflicting with the existing definition in `scylladb/sstables/types.hh` when the `seastar` namespace is opened. The `auto` keyword avoids the need to explicitly reference either type, cleanly resolving the collision while maintaining functionality. This change prepares for the upcoming change to bump up seastar submodule. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23249	2025-03-13 22:51:21 +02:00
Avi Kivity	696ce4c982	Merge "convert some parts of the gossiper to host ids" from Gleb " This is series starts conversion of the gossiper to use host ids to index nodes. It does not touch the main map yet, but converts a lot of internal code to host id. There are also some unrelated cleanups that were done while working on the series. On of which is dropping code related to old shadow round. We replaced shadow round with explicit GOSSIP_GET_ENDPOINT_STATES verb in `cd7d64f588` which is in scylla-4.3.0, so there should be no compatibility problem. We already dropped a lot of old shadow round code in previous patches anyway. I tested manually that old and new node can co-exist in the same cluster, " * 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits) gossiper: drop unneeded code gossiper: move _expire_time_endpoint_map to host_id gossiper: move _just_removed_endpoints to host id gossiper: drop unused get_msg_addr function messaging_service: change connection dropping notification to pass host id only messaging_service: pass host id to remove_rpc_client in down notification treewide: pass host id to endpoint_lifecycle_subscriber treewide: drop endpoint life cycle subscribers that do nothing load_meter: move to host id treewide: use host id directly in endpoint state change subscribers treewide: pass host id to endpoint state change subscribers gossiper: drop deprecated unsafe_assassinate_endpoint operation storage_service: drop unused code in handle_state_removed treewide: drop endpoint state change subscribers that do nothing gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory gossiper: start using host ids to send messages earlier messaging_service: add temporary address map entry on incoming connection topology_coordinator: notify about IP change from sync_raft_topology_nodes as well treewide: move everyone to use host id based gossiper::is_alive and drop ip based one storage_proxy: drop unused template ...	2025-03-13 13:36:31 +02:00
Kefu Chai	5eba29e376	ent/encryption: correct misspellings these misspellings were flagged by codespell. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23254	2025-03-13 13:07:34 +02:00
Kefu Chai	9f411f9962	tools/scylla-nodetool: refactor to use std::tie() for cleaner code Replace explicit pair member access with std::tie() throughout scylla-nodetool. This simplifies the code by eliminating repetitive pair.first/pair.second references and makes the codebase more maintainable and readable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23250	2025-03-13 11:56:07 +02:00
Dawid Mędrek	0a6137218a	db/hints: Cancel draining when stopping node Draining hints may occur in one of the two scenarios: * a node leaves the cluster and the local node drains all of the hints saved for that node, * the local node is being decommissioned. Draining may take some time and the hint manager won't stop until it finishes. It's not a problem when decommissioning a node, especially because we want the cluster to retain the data stored in the hints. However, it may become a problem when the local node started draining hints saved for another node and now it's being shut down. There are two reasons for that: * Generally, in situations like that, we'd like to be able to shut down nodes as fast as possible. The data stored in the hints won't disappear from the cluster yet since we can restart the local node. * Draining hints may introduce flakiness in tests. Replaying hints doesn't have the highest priority and it's reflected in the scheduling groups we use as well as the explicitly enforced throughput. If there are a large number of hints to be replayed, it might affect our tests. It's already happened, see: scylladb/scylladb#21949. To solve those problems, we change the semantics of draining. It will behave as before when the local node is being decommissioned. However, when the local node is only being stopped, we will immediately cancel all ongoing draining processes and stop the hint manager. To amend for that, when we start a node and it initializes a hint endpoint manager corresponding to a node that's already left the cluster, we will begin the draining process of that endpoint manager right away. That should ensure all data is retained, while possibly speeding up the shutdown process. There's a small trade-off to it, though. If we stop a node, we can then remove it. It won't have a chance to replay hints it might've before these changes, but that's an edge case. We expect this commit to bring more benefit than harm. We also provide tests verifying that the implementation works as intended. Fixes scylladb/scylladb#21949 Closes scylladb/scylladb#22811	2025-03-13 11:55:15 +02:00
Paweł Zakrzewski	d483051e44	cql3/select_statement: reject aggregate functions when PER PARTITION LIMIT is present Before this patch we silently allowed and ignored PER PARTITION LIMIT. While using aggregate functions in conjunction with PER PARTITION LIMIT can make sense, we want to disable it until we can offer proper implementation, see #9879 for discussion. We want to match Cassandra, and for queries with aggregate functions it behaves as follows: - it silently ignores PER PARTITION LIMIT if GROUP BY is present, which matches our previous implementation. - rejects PER PARTITION LIMIT when GROUP BY is not present. This patch adds rejection of the second group. Fixes #9879 Closes scylladb/scylladb#23086	2025-03-13 10:29:53 +02:00
Pavel Emelyanov	f50bcbf4d0	test/perf/s3: Don't forget to stop sharded<tester> on error In case invoke_on_all(tester::start) throws, the sharded<tester> instance remains non-stopped and calltrace is reported on test stop. Not nice, fix it so that sharded<> thing is stopped in any case. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23244	2025-03-13 09:54:09 +02:00
Anna Stuchlik	562b5db5b8	doc: Remove "experimental" from ALTER KEYSPACE with Tablets Altering a keyspace with tablets is no longer experimental. This commit removes the "Experimental" label from the feature. Fixes https://github.com/scylladb/scylladb/issues/23166 Closes scylladb/scylladb#23183	2025-03-12 17:41:36 +02:00
Kefu Chai	68fc067106	perf/perf_sstable: fix the indent Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-03-12 19:00:50 +08:00
Kefu Chai	4f62f79622	perf/perf_sstable: stop using at_exit() seastar::at_exit() was marked deprecated recently. so let's use the recommended approach to perform cleanups. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-03-12 19:00:50 +08:00
Nadav Har'El	3ca2e6ddda	Merge 's3_client: Add retries to Security Token Service/EC2 instance metadata credentials providers' from Ernest Zaslavsky Several updates and improvements to the retryable HTTP client functionality, as well as enhancements to error handling and integration with AWS services, as part of this PR. Below is a summary of the changes: - Moved the retryable HTTP client functionality out of the S3 client to improve modularity and reusability across other services like AWS STS. - Isolated the retryable_http_client into its own file, improving clarity and maintainability. - Added a make_request method that introduces a response-skipping handler. - Introduced a custom error handler constructor, providing greater flexibility in handling errors. - Updated the STS and Instance Metadata Service credentials providers to utilize the new retryable HTTP client, enhancing their robustness and reliability. - Extended the AWS error list to handle errors specific to the STS service, ensuring more granular and accurate error management for STS operations. - Enhanced error handling for system errors returned by Seastar’s HTTP client, ensuring smoother operations. - Properly closed the HTTP client in instance_profile_credentials_provider and sts_assume_role_credentials_provider to prevent resource leaks. - Reduced the log severity in the retry strategy to avoid SCT test failures that occur when any log message is tagged as an ERROR. No backport needed since we dont have any s3 related activity on the scylla side been released Closes scylladb/scylladb#21933 * github.com:scylladb/scylladb: s3_client: Adjust Log Severity in Retry Strategy aws_error: Enhance error handling for AWS HTTP client aws_error: Add STS specific error handling credentials_providers: Close retryable clients in Credentials Providers credentials_providers: Integrate retryable_http_client with Credentials Providers s3_client: enhance `retryable_http_client` functionality s3_client: isolate `retryable_http_client` s3_client: Prepare for `retryable_http_client` relocation s3_client: Remove `is_redirect_status` function s3_client: Move retryable functionality out of s3 client	2025-03-12 10:19:15 +02:00
Avi Kivity	b1d9f80d85	Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec Before this patch, the load balancer was equalizing tablet count per shard, so it achieved balance assuming that: 1) tablets have the same size 2) shards have the same capacity That can cause imbalance of utilization if shards have different capacity, which can happen in heterogeneous clusters with different instance types. One of the causes for capacity difference is that larger instances run with fewer shards due to vCPUs being dedicated to IRQ handling. This makes those shards have more disk capacity, and more CPU power. After this patch, the load balancer equalizes shard's storage utilization, so it no longer assumes that shards have the same capacity. It still assumes that each tablet has equal size. So it's a middle step towards full size-aware balancing. One consequence is that to be able to balance, the load balancer need to know about every node's capacity, which is collected with the same RPC which collects load_stats for average tablet size. This is not a significant set back because migrations cannot proceed anyway if nodes are down due to barriers. We could make intra-node migration scheduling work without capacity information, but it's pointless due to above, so not implemented. Also, per-shard goal for tablet count is still the same for all nodes in the cluster, so nodes with less capacity will be below limit and nodes with more capacity will be slightly above limit. This shouldn't be a significant problem in practice, we could compensate for this by increasing the limit. Refs #23042 Closes scylladb/scylladb#23079 * github.com:scylladb/scylladb: tablets: Make load balancing capacity-aware topology_coordinator: Fix confusing log message topology_coordinator: Refresh load stats after adding a new node topology_coordinator: Allow capacity stats to be refreshed with some nodes down topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places test: boost: tablets_test: Always provide capacity in load_stats test: perf_load_balancing: Set node capacity test: perf_load_balancing: Convert to topology_builder config, disk_space_monitor: Allow overriding capacity via config storage_service, tablets: Collect per-node capacity in load_stats	2025-03-11 14:34:27 +02:00
Gleb Natapov	57f2b6d825	gossiper: drop unneeded code host_id is already available at this point.	2025-03-11 12:09:22 +02:00
Gleb Natapov	cca228265e	gossiper: move _expire_time_endpoint_map to host_id Index _expire_time_endpoint_map map by host id instead of ip	2025-03-11 12:09:22 +02:00
Gleb Natapov	c45b50bbe6	gossiper: move _just_removed_endpoints to host id Index _just_removed_endpoints map by host id instead of ip	2025-03-11 12:09:22 +02:00
Gleb Natapov	22739bb39a	gossiper: drop unused get_msg_addr function	2025-03-11 12:09:22 +02:00
Gleb Natapov	b3720b80b6	messaging_service: change connection dropping notification to pass host id only Only host id is needed in the callback anyway.	2025-03-11 12:09:22 +02:00
Gleb Natapov	24d30073f9	messaging_service: pass host id to remove_rpc_client in down notification Do not iterate over all client indexed by hos id to search for those with given IP. Look up by host id directly since now we know it in down notification. In cases host id is not known look it up by ip.	2025-03-11 12:09:22 +02:00
Gleb Natapov	4ca627b533	treewide: pass host id to endpoint_lifecycle_subscriber	2025-03-11 12:09:22 +02:00
Gleb Natapov	8a747fbc2a	treewide: drop endpoint life cycle subscribers that do nothing Provide default implementation for them instead. Will be easier to rework them later.	2025-03-11 12:09:22 +02:00
Gleb Natapov	525b88f877	load_meter: move to host id Use host id indexing in load_meter and only convert to ips on api level.	2025-03-11 12:09:22 +02:00
Gleb Natapov	48a1030c91	treewide: use host id directly in endpoint state change subscribers Now that we have host ids in endpoint state change subscribers some of them can be simplified by using the id directly instead of locking it up by ip.	2025-03-11 12:09:22 +02:00
Gleb Natapov	499eb4d17f	treewide: pass host id to endpoint state change subscribers	2025-03-11 12:09:22 +02:00
Gleb Natapov	eb59205caf	gossiper: drop deprecated unsafe_assassinate_endpoint operation It was always deprecated.	2025-03-11 12:09:21 +02:00
Gleb Natapov	c17a8b4a76	storage_service: drop unused code in handle_state_removed	2025-03-11 12:09:21 +02:00
Gleb Natapov	696aee3adc	treewide: drop endpoint state change subscribers that do nothing Provide default implementation for them instead. Will be easier to rework them later.	2025-03-11 12:09:21 +02:00
Gleb Natapov	7dcffda6bd	gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory	2025-03-11 12:09:21 +02:00
Gleb Natapov	8425c26462	gossiper: start using host ids to send messages earlier Send digest ack and ack2 by host ids as well now since the id->ip mapping is available after receiving digest syn. It allows to convert more code to host id here.	2025-03-11 12:09:21 +02:00
Gleb Natapov	f0af3f261e	messaging_service: add temporary address map entry on incoming connection We want to move to use host ids as soon as possible. Currently it is possible only after the full gossiper exchange (because only at this point gossiper state is added and with it address map entry). To make it possible to move to host ids earlier this patch adds address map entries on incoming communication during CLIENT_ID verb processing. The patch also adds generation to CLIENT_ID to use it when address map is updated. It is done so that older gossiper entries can be overwritten with newer mapping in case of IP change.	2025-03-11 12:09:21 +02:00
Gleb Natapov	c3035caeb5	topology_coordinator: notify about IP change from sync_raft_topology_nodes as well Currently sync_raft_topology_nodes() only send join notification if a node is new in the topology, but sometimes a node changes IP and the join notification should be send for the new IP as well. Usually it is done from ip_address_updater, but topology reload can run first and then the notification will be missed. The solution is to send notification during topology reload as well.	2025-03-11 12:09:21 +02:00
Gleb Natapov	0e3dcb7954	treewide: move everyone to use host id based gossiper::is_alive and drop ip based one	2025-03-11 12:09:21 +02:00
Gleb Natapov	56c6e04079	storage_proxy: drop unused template The storage_proxy::is_alive is called with host_id only.	2025-03-11 12:09:21 +02:00
Gleb Natapov	e47f251178	gossiper: move _live_endpoints and _unreachable_endpoints endpoint to host_id Index live and dead endpoints by host id. It also allows to simplify some code that does a translation.	2025-03-11 12:09:21 +02:00
Gleb Natapov	6f05608b5e	gossiper: chunk vector using std::views::chunk instead of explicitly code it	2025-03-11 12:09:21 +02:00
Gleb Natapov	0437f558cd	idl: generate ip based version of a verb only for verbs that need it The patch adds new marker for a verb - [[ip]] that means that for this verb ip version of the verbs needs to be generated. Most of the verbs do not need it.	2025-03-11 12:09:21 +02:00
Gleb Natapov	3734afe8a5	gossiper: send shutdown notification by host id	2025-03-11 12:09:21 +02:00
Gleb Natapov	ee59baf6fc	gossiper: drop old shadow round code It is no longer used. It was replaced with explicit GOSSIP_GET_ENDPOINT_STATES verb in `cd7d64f588` which is in scylla-4.3.0	2025-03-11 12:09:20 +02:00
Gleb Natapov	f1a82c1d01	gossiper: drop unused get_endpoint_states function	2025-03-11 12:09:20 +02:00
Gleb Natapov	c4a0fbae16	gossiper: check id match inside force_remove_endpoint Before calling force_remove_endpoint (which works on ip) the code checks that the ip maps to the correct id (not not remove a new node that inherited this ip by mistake). Move the check to the function itself.	2025-03-11 12:09:20 +02:00
Gleb Natapov	52c9217f1b	migration_manager: drop unneeded id to ip translation	2025-03-11 12:09:20 +02:00
Gleb Natapov	4420ddaf86	gossiper: move is_gossip_only_member and its users to work on host id	2025-03-11 12:09:20 +02:00
Gleb Natapov	cb2b874942	table: use host id based get_endpoint_state_ptr and skip id->ip translation	2025-03-11 12:09:20 +02:00
Gleb Natapov	2746d391af	gossiper: do not ping outdated address A node may change its IP but some other node in the cluster may still try to ping it using an old IP because it may receive an outdated gossiper entry with the old IP. Do not send echo message to the old IP. It will cause a misusing UP message with old address to be printed.	2025-03-11 12:09:20 +02:00
Gleb Natapov	aaba55073d	storage_service: drop outdated code that checks whether raft topology should be used After raft_topology_change_enabled() was introduced the code does nothing useful. The function is responsible for the decision if raft topology is enabled or not.	2025-03-11 12:09:20 +02:00
Gleb Natapov	6952f62869	gossiper: drop unused field from loaded_endpoint_state	2025-03-11 12:09:20 +02:00
Avi Kivity	f18e8edcb7	Merge 'dist/docker: switch to UBI9' from Takuya ASADA Switch container base image to UBI9, and make it ready for Red Hat OpenShift Certification. Fixes https://github.com/scylladb/scylla-pkg/issues/4858 Closes scylladb/scylladb#22910 * github.com:scylladb/scylladb: dist/docker: run the container as non-root user dist/docker: switch to UBI9	2025-03-10 15:33:30 +02:00
Luis Freitas	09e790d5af	.github: Update github action for triggering next gating Before we were using a marketplace Github action which had some limitations. With this pull request we are updating the github action using curl option which will gives us full control of the flow instead of relying on pre made github action. Fixes: scylladb#23088 Closes scylladb/scylladb#23215	2025-03-10 14:38:08 +02:00
Nikos Dragazis	7a6a4f54a5	cql3: secondary index: Limit page size for single-row partitions The size of the partition range vector was constrained in the previous patch. Any rows beyond the vector's capacity are discarded. In the special case of single-row partitions, we know the size of each partition, so we can enforce this limit on the query itself via the page size. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-03-10 12:18:49 +02:00
Nikos Dragazis	76b31a3acc	cql3: secondary index: Limit the size of partition range vectors The partition range vector is an std::vector, which means it performs contiguous allocations. Large allocations are known to cause problems (e.g., reactor stalls). For paged queries, limit the vector size to 1000. If more partition keys are available in the query result, discard them. Ideally, we should not be fetching them at all, but this is not possible without knowing the size of each partition. Currently, each vector element is 120 bytes and the standard allocator's max preferred contiguous allocation is 128KiB. Therefore, the chosen value of 1000 satisfies the constraint (128 KiB / 120 = 1092 > 1000). This should be good enough for most cases. Since secondary index queries involve one base table query per partition key, these queries are slow. A higher limit would only make them slower and increase the probability of a timeout. For the same reason, saving a follow-up paged request from the client would not increase the efficiency much. For unpaged queries, do not apply any limit. This means they remain susceptible to stalls, but unpaged queries are considered unoptimized anyway. Finally, update the unit test reproducer since the bug is now fixed. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-03-10 12:18:42 +02:00
Pavel Emelyanov	db70c7bbf7	api: Remove the remaining parse_tables() overload There's only one caller of it left -- the scrub handler. It can use the parse_table_infos() one and get table names from it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:14:10 +03:00
Pavel Emelyanov	89f3c1a91e	database: Sanitize flush_tables_on_all_shards() Previous patch left this method with few uglinesses - the vector<table_id> argument is named table_names - the sstring keyspace argument is unused - the keyspace argument is captured for no use Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:13:10 +03:00
Pavel Emelyanov	0f9cc956f4	schema_tables: Remove all_table_names() Now it's unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:12:56 +03:00
Pavel Emelyanov	c2d23d7948	database: Make tables flushing helper use table_info-s, not names The database::flush_tables_on_all_shards() method accepts a keyspace name and a vector of table names. Then it converts ks:cf pair for each of the table name into a table-id and flushes the table with the ID. All the callers of that method already have or can easily get the vector of table_id-s, not just names, so make use of this. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:11:32 +03:00
Pavel Emelyanov	e94dce1725	api: Make keyspace flush endpoint use parse_table_infos() (and a bit more) Currently the handler in question calls parse_tables() which returns empty list of tables in the "cf" parameter is missing, or the table names if it's present. In the former case the handler will call flush_keyspace_on_all_shards() that just gets all table names from the keyspace and flushes them all. This change makes the handler use parse_table_infos() which is different -- when the "cf" parameter is missing, it gets all tables from the keyspace. So the handler no longer need to call the keyspace flush, it can always call the "flush the list of tables" helper. With that change one of the parse_tables() helpers becomes unused, so remove it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:06:55 +03:00
Pavel Emelyanov	5a897d7368	schema_tables,client_state: Switch to using all_table_infos() There are few more places left that can use all_table_infos() as a replacement for all_table_names(), patch them. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:05:59 +03:00
Pavel Emelyanov	da05765746	schema_tables: Tune up some methods to benefit from table_infos There are convert_schema_to_mutations() and calculate_schema_digest() that collect table names and then use them to find schema and query mutations from the table. Both can use the newly introduced all_table_infos() and use the returned table_id-s to do the same, thus avoiding re-lookups (which are fast anyway, but still). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:01:50 +03:00
Pavel Emelyanov	d7bfa5a545	schema_tables: Introduce all_table_infos() This method is like all_table_names(), but returns a vector of table_info-s which is effectively a pair of string name and uuid id. To be used later, and the string-returning all_table_name() will be removed very soon too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 12:59:03 +03:00
Ernest Zaslavsky	c8de7619e5	s3_client: Adjust Log Severity in Retry Strategy * Reduced log severity in retry_strategy. * Rationale: SCT fails tests when any message is logged as ERROR.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	8e46929474	aws_error: Enhance error handling for AWS HTTP client - Seastar's HTTP client is known to throw exceptions for various reasons, including network errors, TLS errors and other transient issues. - Update error handling to correctly capture and process all exceptions from Seastar's HTTP client. - Previously, only aws_exception was handled, causing retryable errors to be missed and `should_retry` not invoked. - Now, all exceptions trigger the appropriate retry logic per the intended strategy. - Add tests for the S3 proxy to ensure robustness and reliability of these enhancements.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	92a12c96a2	aws_error: Add STS specific error handling Updated the AWS error list to include handling for errors specific to the STS service. This enhancement ensures more comprehensive error management for STS-related operations.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	a371d6cf62	credentials_providers: Close retryable clients in Credentials Providers Updated `instance_profile_credentials_provider` and `sts_assume_role_credentials_provider` to close the HTTP client appropriately.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	45a6e88954	credentials_providers: Integrate retryable_http_client with Credentials Providers * Updated STS and Instance Metadata Service credentials providers to utilize retryable_http_client.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	7c49ee4520	s3_client: enhance `retryable_http_client` functionality Enhanced `retryable_http_client` by allowing the injection of a custom error handler through its constructor.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	b589a882bb	s3_client: isolate `retryable_http_client` Relocated `retryable_http_client` into its own dedicated file for improved clarity and maintainability.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	5eff83af95	s3_client: Prepare for `retryable_http_client` relocation Expose `map_s3_client_exception` outside the S3 client class to facilitate moving `retryable_http_client` to a separate file.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	2b3abba10a	s3_client: Remove `is_redirect_status` function Eliminate the `is_redirect_status` function in favor of the equivalent functionality provided by Seastar's HTTP client.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	5b7d4a4136	s3_client: Move retryable functionality out of s3 client This commit moves the retryable HTTP client functionality out of the S3 client implementation. Since this functionality is also required for other services, such as AWS STS, it has been separated to ensure broader applicability.	2025-03-10 09:01:47 +02:00
Piotr Szymaniak	b6ba573dfe	HACKING.md: Provide step-by-step support to enable development with CLion Claim that building with CMake files is just 'not supported' instead of not intended, especially that there are attempts to enable this. Remove the obsolete mention of the `FOR_IDE` flag. Closes scylladb/scylladb#22890	2025-03-09 16:22:24 +02:00
Ernest Zaslavsky	6a3cef5703	metadata: Correct "DESCRIBE" output for keyspace metadata Update the "DESCRIBE" command output to accurately display `tablet` settings in keyspace metadata. Closes scylladb/scylladb#23056	2025-03-09 14:50:08 +02:00
Ernest Zaslavsky	050c3cdbc2	tests: Add Tests for Scylla-SSTable S3 Functionality Extended existing Scylla Tools tests to cover the new functionality of reading SSTables from S3. This ensures that the new S3 integration is thoroughly tested and performs as expected.	2025-03-09 10:17:48 +02:00
Ernest Zaslavsky	112b4c8764	docs: Update Scylla Tools Documentation for S3 SSTable Support Updated the Scylla Tools documentation to include changes related to the enhanced support for S3-stored SSTables. This update ensures that the documentation accurately reflects the latest functionality and improvements.	2025-03-09 09:50:37 +02:00
Ernest Zaslavsky	17e3c01f4e	scylla-sstable: Enable Support for S3 SSTables Configure the sstable manager to correctly handle storage options based on the input type (local or S3-stored sstables). This tweak allows for mixing both storage types within a single call, improving flexibility and functionality.	2025-03-09 09:50:36 +02:00
Ernest Zaslavsky	88c4fa6569	s3: Implement S3 Fully Qualified Name Manipulation Functions Added utility functions to handle S3 Fully Qualified Names (FQN). These functions enable parsing, splitting, and identification of S3 paths, enhancing our ability to work with S3 object storage more effectively.	2025-03-09 09:50:36 +02:00
Ernest Zaslavsky	38165fd285	object_storage: Refactor `object_storage.yaml` parsing logic Refactored the parsing of `object_storage.yaml` out of Scylla's `main` function. This change is made to facilitate reusability of the parsing logic in other parts of the codebase.	2025-03-09 09:50:36 +02:00
Anna Stuchlik	9ac0aa7bba	doc: zero-token nodes and Arbiter DC This commit adds documentation for zero-token nodes and an explanation of how to use them to set up an arbiter DC to prevent a quorum loss in multi-DC deployments. The commit adds two documents: - The one in Architecture describes zero-token nodes. - The other in Cluster Management explains how to use them. We need separate documents because zero-token nodes may be used for other purposes in the future. In addition, the documents are cross-linked, and the link is added to the Create a ScyllaDB Cluster - Multi Data Centers (DC) document. Refs https://github.com/scylladb/scylladb/pull/19684 Fixes https://github.com/scylladb/scylladb/issues/20294 Closes scylladb/scylladb#21348	2025-03-07 16:39:02 +01:00
Kefu Chai	2a9966a20e	gms: Fix fmt formatter for gossip_digest_sync In commit `4812a57f`, the fmt-based formatter for gossip_digest_syn had formatting code for cluster_id, partitioner, and group0_id accidentally commented out, preventing these fields from being included in the output. This commit restores the formatting by uncommenting the code, ensuring full visibility of all fields in the gossip_digest_syn message when logging permits. This fixes a regression introduced in `4812a57f`, which obscured these fields and reduced debugging insight. Backporting is recommended for improved observability. Fixes #23142 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23155	2025-03-07 15:36:03 +01:00
Robert Bindar	27f2d64725	Remove object storage config credentials provider During development of #22428 we decided that we have no need for `object-storage.yaml`, and we'd rather store the endpoints in `scylla.yaml` and get a REST api to exopose the endpoints for free. This patch removes the credentials provider used to read the aws keys from this yaml file. Followup work will remove the `object-storage.yaml` file altogether and move the endpoints to `scylla.yaml`. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#22951	2025-03-07 10:40:58 +03:00
Luis Freitas	84b30d11ec	.github: trigger Jenkins job using github action This action will help preventing next-trigger for running every 15 minutes. This action will run on push for a specific branch (next, next-enterprise, 2024.x, x.x) Fixes: scylladb#23088 update action Closes scylladb/scylladb#23141	2025-03-07 06:41:58 +02:00
Vlad Zolotarov	f7e1695068	CQL Tracing: set common query parameters in a single function Each query-type (QUERY, EXECUTE, BATCH) CQL opcode has a number of parameters in their payload which we always want to record in the Tracing object. Today it's a Consistency Level, Serial Consistency Level and a Default Timestamp. Setting each of them individually can lead to a human error when one (or more) of them would not be set. Let's eliminate such a possibility by defining a single function that sets them all. This also allows an easy addition of such parameters to this function in the future.	2025-03-06 09:30:51 -05:00
Aleksandra Martyniuk	35bc1fe276	streaming: fix the way a reason of streaming failure is determined During streaming receiving node gets and processes mutation fragments. If this operation fails, receiver responds with -1 status code, unless it failed due to no_such_column_family in which case streaming of this table should be skipped. However, when the table was dropped, an exception handler on receiver side may get not only data_dictionary::no_such_column_family, but also seastar::nested_exception of two no_such_column_family. Encountered example: ``` ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14)) ``` In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family> clause and gets handled the same as any other exception type. Replace try_catch clause with table_sync_and_check that synchronizes the schema and check if the table exists. Fixes: https://github.com/scylladb/scylladb/issues/22834.	2025-03-06 15:07:14 +01:00
Aleksandra Martyniuk	44748d624d	streaming: save a continuation lambda In the following patches, an additional preemption point will be added to the coroutine lambda in register_stream_mutation_fragments. Assign a lambda to a variable to prolong the captures lifetime.	2025-03-06 15:07:09 +01:00
Tomasz Grabiec	c4714180cc	tablets: Make load balancing capacity-aware Before this patch the load balancer was equalizing tablet count per shard, so it achieved balance assuming that: 1) tablets have the same size 2) shards have the same capacity That can cause imbalance of utilization if shards have different capacity, which can happen in heterogenous clusters with different instance types. One of the causes for capacity difference is that larger instances run with fewer shards due to vCPUs being dedicated to IRQ handling. This makes those shards have more disk capacity, and more CPU power. After this patch, the load balancer equalizes shard's storage utilization, so it no longer assumes that shards have the same capacity. It still assummes that each tablet has equal size. So it's a middle step towards full size-aware balancing. One consequence is that to be able to balance, the load balancer need to know about every node's capacity, which is collected with the same RPC which collects load_stats for average tablet size. This is not a significant set back because migrations cannot proceed anyway if nodes are down due to barriers. We could make intra-node migration scheduling work without capacity information, but it's pointless due to above, so not implemented.	2025-03-06 13:35:38 +01:00
Tomasz Grabiec	3c0b733943	topology_coordinator: Fix confusing log message There can be other reasons the plan is empty, tablets may not actually be balanced. For example, capacity for all the nodes may not be known, or nodes may be down.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	40414c4985	topology_coordinator: Refresh load stats after adding a new node Stats are refreshed every minute by default. Load balancing cannot happen without capacity information for all normal nodes. To avoid the delay, trigger refresh after adding a new node.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	d6f8810e66	topology_coordinator: Allow capacity stats to be refreshed with some nodes down With capacity-aware balancing, if we're missing capacity for a normal node, we won't be able to proceed with tablet drain. Consider the following scenario: 1. Nodes: A, B 2. refresh stats with A and B 3. Add node C 4. Node B goes down 5. removenode B starts 6. stats refreshing fails because B is down If we don't have capacity stats for node C, load balancer cannot make decisions and removenode is blocked indefinitely. A reproducer is added in this patch. To alleviate that, we allow capacity stats to be collected for nodes which are reachable, we just don't update the table size part. To keep table stats monotonic, we cache previous results per node, so even if it's unreachable now, we use its last reported sizes. It's still more accurate than not refreshing stats at all. A node can be down for a long period, and other replicas can grow in size. It's not perfect, because the stale node can skew the stats in its direction, but ignoring it completely has its pitfalls too. Better solution is left for later.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	af3dce4c8a	topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places Use serialized_action for serialization and batching.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	69c49fb1a7	test: boost: tablets_test: Always provide capacity in load_stats Move shared_load_stats to topology_builder.hh so that topology_builder can maintain it. It will set capacity for all created nodes. Needed after load balancer requires capacity to make decisions.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	dfc9101dfd	test: perf_load_balancing: Set node capacity Otherwise, load balancer will not make any plan once it becomes capacity-aware.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	6169401dbc	test: perf_load_balancing: Convert to topology_builder The test no longer worked becuase load balancer requires proper schema in the database now. Convert to topology_builder which builds topology in the database and create schema in the database (which needs proper topology).	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	d01cc16d1e	config, disk_space_monitor: Allow overriding capacity via config Intended for testing, or hot-fixing out-of-space issues in production. Tablet load balancer uses this information for determining per-shard load so reducing capacity will cause tablets to be migrated away from the node.	2025-03-06 13:35:37 +01:00
Avi Kivity	28906c9261	Merge 'scylla-sstable: introduce the query command' from Botond Dénes The scylla-sstable dump-* command suite has proven invaluable in many investigations. In certain cases however, I found that `dump-data` is quite cumbersome. An example would be trying to find certain values in an sstable, or trying to read the content of system tables when a node is down. For these cases, `dump-data` is very cumbersome: one has to trudge through tons of uninteresting metadata and do compaction in their heads. This PR introduces the new scylla-sstable query command, specifically targeted at situations like this: it allows executing queries on sstables, exposing to the user all the power of CQL, to tailor the output as they see fit. Select everything from a table: $ scylla sstable query --system-schema /path/to/data/system_schema/keyspaces-/-big-Data.db keyspace_name \| durable_writes \| replication -------------------------------+----------------+------------------------------------------------------------------------------------- system_replicated_keys \| true \| ({class : org.apache.cassandra.locator.EverywhereStrategy}) system_auth \| true \| ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 1}) system_schema \| true \| ({class : org.apache.cassandra.locator.LocalStrategy}) system_distributed \| true \| ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 3}) system \| true \| ({class : org.apache.cassandra.locator.LocalStrategy}) ks \| true \| ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1}) system_traces \| true \| ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 2}) system_distributed_everywhere \| true \| ({class : org.apache.cassandra.locator.EverywhereStrategy}) Select everything from a single SSTable, use the JSON output (filtered through [jq](https://jqlang.github.io/jq/) for better readability): $ scylla sstable query --system-schema --output-format=json /path/to/data/system_schema/keyspaces-/me-3gm7_127s_3ndxs28xt4llzxwqz6-big-Data.db \| jq [ { "keyspace_name": "system_schema", "durable_writes": true, "replication": { "class": "org.apache.cassandra.locator.LocalStrategy" } }, { "keyspace_name": "system", "durable_writes": true, "replication": { "class": "org.apache.cassandra.locator.LocalStrategy" } } ] Select a specific field in a specific partition using the command-line: $ scylla sstable query --system-schema --query "select replication from scylla_sstable.keyspaces where keyspace_name='ks'" ./scylla-workdir/data/system_schema/keyspaces-/-Data.db replication ------------------------------------------------------------------------------------- ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1}) Select a specific field in a specific partition using ``--query-file``: $ echo "SELECT replication FROM scylla_sstable.keyspaces WHERE keyspace_name='ks';" > query.cql $ scylla sstable query --system-schema --query-file=./query.cql ./scylla-workdir/data/system_schema/keyspaces-/-Data.db replication ------------------------------------------------------------------------------------- ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1}) New functionality: no backport needed. Closes scylladb/scylladb#22007 github.com:scylladb/scylladb: docs/operating-scylla: document scylla-sstable query test/cqlpy/test_tools.py: add tests for scylla-sstable query test/cqlpy/test_tools.py: make scylla_sstable() return table name also scylla-sstable: introduce the query command tools/utils: get_selected_operation(): use std::string for operation_options utils/rjson: streaming_writer: add RawValue() cql3/type_json: add to_json_type() test/lib/cql_test_env: introduce do_with_cql_env_noreentrant_in_thread()	2025-03-06 13:42:45 +02:00
Tomasz Grabiec	7e7f1e6f91	storage_service, tablets: Collect per-node capacity in load_stats New RPC is introduced becuase load_stats was marked "final" in the IDL. Will be needed by capacity-aware load balancing.	2025-03-06 12:17:32 +01:00
Botond Dénes	1139cf3a98	Merge 'Speed up (and generalize) the way API calculates sstable disk usage' from Pavel Emelyanov There are several API endpoints that walk a specific list of sstables and sum up their bytes_on_disk() values. All those endpoints accumulate a map of sstable names to their sizes, then squashe the maps together and, finally, sum up the map values to report it back. Maintaining these intermediate collections is the waste of CPU and memory, the usage values can be summed up instantly. Also add a test for per-cf endpoints to validate the change, and generalize the helper functions while at it. Closes scylladb/scylladb#23143 * github.com:scylladb/scylladb: api: Generalize disk space counting for table and system api: Use map_reduce_cf_raw() overload with table name api: Don't collect sstables map to count disk space usage test: Add unit test for total/live sstable sizes	2025-03-06 11:26:35 +02:00
Raphael S. Carvalho	fedd838b9d	replica: Fix race of some operations like cleanup with snapshot There are two semaphores in table for synchronizing changes to sstable list: sstable_set_mutation_sem: used to serialize two concurrent operations updating the list, to prevent them from racing with each other. sstable_deletion_sem: A deletion guard, used to serialize deletion and iteration over the list, to prevent iteration from finding deleted files on disk. they're always taken in this order to avoid deadlocks: sstable_set_mutation_sem -> sstable_deletion_sem. problem: A = tablet cleanup B = take_snapshot() 1) A acquires sstable_set_mutation_sem for updating list 2) A acquires sstable_deletion_sem, then delete sstable before updating list 3) A releases sstable_deletion_sem, then yield 4) B acquires sstable_deletion_sem 5) B iterates through list and bumps sstable deleted in step 2 6) B fails since it cannot find the file on disk Initial reaction is to say that no procedure must delete sstable before updating the list, that's true. But we want a iteration, running concurrently to cleanup, to not find sstables being removed from the system. Otherwise, e.g. snapshot works with sstables of a tablet that was just cleaned up. That's achieved by serializing iteration with list update. Since sstable_deletion_sem is used within the scope of deletion only, it's useless for achieving this. Cleanup could acquire the deletion sem when preparing list updates, and then pass the "permit" to deletion function, but then sstable_deletion_sem would essentially become sstable_set_mutation_sem, which was created exactly to protect the list update. That being said, it makes sense to merge both semaphores. Also things become easier to reason about, and we don't have to worry about deadlocks anymore. The deletion goes through sstable_list_builder, which holds a permit throughout its lifetime, which guarantees that list updates and deletion are atomic to other concurrent operations. The interface becomes less error prone with that. It allowed us to find discard_sstables() was doing deletion without any permit, meaning another race could happen between truncate and snapshot. So we're fixing race of (truncate\|cleanup) with take_snapshot, as far as we know. It's possible another unknown races are fixed as well. Fixes #23049. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23117	2025-03-06 11:00:48 +02:00
Pavel Emelyanov	86b3e9b50b	code: Move checked-file-impl.hh to util/ fixes: #22100 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23123	2025-03-06 10:22:05 +02:00
Petr Hála	f3c3eb6ae3	doc: Fix object_storage_config_file option It needs to use underscores, not dash Closes scylladb/scylladb#23161	2025-03-06 10:30:51 +03:00
Pavel Emelyanov	e7d1ea3ab6	commitlog: Use shorter input stream creation overload There's one that doesn't need the offset argument when it's 0 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23140	2025-03-06 08:06:42 +01:00
Botond Dénes	49d6bf8947	Merge 'main: safely check stop_signal in-between starting services' from Benny Halevy To simplify aborting scylla while starting the services, add a _ready state to stop_signal, so that until main is ready to be stopped by the abort_source, just register that the signal is caught, and let a check() method poll that and request abort and throw respective exception only then, in controlled points that are in-between starting of services after the service started successfully and a deferred stop action was installed. This patch prevents gate_closed_exception to escape handling when start-up is aborted early with the stop signal, causing https://github.com/scylladb/scylladb/issues/23153 The regression is apparently due to `a25c3eaa1c` Fixes https://github.com/scylladb/scylladb/issues/23153 * Requires backport to 2025.1 due to `a25c3eaa1c` Closes scylladb/scylladb#23103 * github.com:scylladb/scylladb: main: add checkpoints main: safely check stop_signal in-between starting services main: move prometheus start message main: move per-shard database start message	2025-03-06 08:28:29 +02:00
Vlad Zolotarov	ca6bddef35	transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver. However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then). This patch fixes this. Fixes #23173	2025-03-05 20:37:37 -05:00
Takuya ASADA	781dec5852	dist/docker: run the container as non-root user Since it is requirement for Red Hat OpenShift Certification, we need to run the container as non-root user. Related scylladb/scylla-pkg#4858 Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2025-03-05 23:39:56 +09:00
Takuya ASADA	1abf981a73	dist/docker: switch to UBI9 Switch container base image to UBI9, to prepare for Red Hat OpenShift Certification. Fixes scylladb/scylla-pkg#4858 Signed-off-by: Takuya ASADA <syuu@scylladb.com>	2025-03-05 23:39:56 +09:00
Aleksandra Martyniuk	faf3aa13db	streaming: use streaming namespace in table_check.{cc,hh}	2025-03-05 11:00:03 +01:00
Aleksandra Martyniuk	876cf32e9d	repair: streaming: move table_check.{cc,hh} to streaming	2025-03-05 11:00:03 +01:00
Benny Halevy	8ae8275f17	main: stop system keyspace To prevent internal queries coming from system_keyspace (like updating compaction history, for example) Refs scylladb/scylla-dtest#5581 Refs #22886 Refs #8995 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:30:23 +02:00
Benny Halevy	7a624e3df8	system_keyspace: call shutdown from stop and use that to replace the explicit shutdown when stopped in cql_test_env. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:30:23 +02:00
Benny Halevy	102aec64d5	system_keyspace: shutdown: allow calling more than once Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:30:22 +02:00
Benny Halevy	fba88bdd62	database, compaction_manager, large_data_handler: use pluggable<system_keysapce> To allow safe plug and unplug of the system_keyspace. This patch follows-up on `917fdb9e53` (more specifically - `f9b57df471`) Since just keeping a shared_ptr<system_keyspace> doesn't prevent stopping the system_keyspace shards, while using the `pluggable` interface allows safe draining of outstanding async calls on shutdown, before stopping the system_keyspace. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:27:23 +02:00
Benny Halevy	13a22cb6fd	utils: add class pluggable A wrapper around a shared service allowing safe plug and unplug of the service from its user using a phased-barrier operation permit guarding the service while in use. Also add a unit test for this class. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:25:50 +02:00
Benny Halevy	b6705ad48b	main: add checkpoints Before starting significant services that didn't have a corresponding call to supervisor::notify before them. Fixes #23153 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 07:29:34 +02:00
Benny Halevy	feef7d3fa1	main: safely check stop_signal in-between starting services To simplify aborting scylla while starting the services, Add a _ready state to stop_signal, so that until main is ready to be stopped by the abort_source, just register that the signal is caught, and let a check() method poll that and request abort and throw respective exception only then, in controlled points that are in-between starting of services after the service started successfully and a deferred stop action was installed. Refs #23153 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 07:15:17 +02:00
Benny Halevy	282ff344db	main: move prometheus start message The `prometheus_server` is started only conditionally but the notification message is sent and logged unconditionally. Move it inside the condtional code block. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 07:09:09 +02:00
Benny Halevy	23433f593c	main: move per-shard database start message It is now logged out of place, so move it to right before calling `start` on every database shard. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 07:09:09 +02:00
Nadav Har'El	e0f24c03e7	Merge 'test.py: merge all 'Topology' suite types int one folder 'cluster'' from Artsiom Mishuta Now that we support suite subfolders, there is no need to create an own suite for object_store and auth_cluster, topology, topology_custom. this PR merge all these folders into one: 'cluster" this pr also introduce and apply 'prepare_3_nodes_cluster' fixture that allow preparing non-dirty 3 nodes cluster that can be reused between tests(for tests that was in topology folder) number of tests in master release -3461 dev -3472 debug -3446 number of tests in this PR release -3460 dev -3471 debug -3445 There is a minus one test in each mode because It was 2 test_topology_failure_recovery files(topology and topology_custom) with the same utility functions but different test cases. This PR merged them into one Closes scylladb/scylladb#22917 * github.com:scylladb/scylladb: test.py: merge object_store into cluster folder test.py: merge auth_cluster into cluster folter test.py: rename topology_custom folder to cluster test.py: merge topology test suite into topology_custom test.py delete conftest in topology_custom test.py apply prepare_3_nodes_cluster in topology test.py: introduce prepare_3_nodes_cluster marker	2025-03-04 19:26:32 +02:00
Pavel Emelyanov	c084de1406	api: Generalize disk space counting for table and system Now when the bodies of both map-reduce reducers are the same, they can be generalized with each other. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-04 19:56:16 +03:00
Pavel Emelyanov	4e2abba5a1	api: Use map_reduce_cf_raw() overload with table name The existing helper that counds disk space usage for a table map-reduces the table object "by hand". Its peer that counts the usage for all tables uses the map_reduce_cf_raw() helper. The latter exists for specific table as well, so the first counter can benefit from using it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-04 19:55:05 +03:00
Pavel Emelyanov	b43e2390db	api: Don't collect sstables map to count disk space usage All the API calls that collect disk usage of sstables accumulate map<sstable name, disk size>, then merges shard maps into one, then counts the "disk size" values and drops the map itself on the floor. This is waste of CPU cycles, disk usage can be just summed up along cf/sstables iterations, no need to accumulate map with names for that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-04 19:53:42 +03:00
Pavel Emelyanov	a8fc1d64bc	test: Add unit test for total/live sstable sizes The pair of column_family/metrics/(total\|live)_disk_space_used/{name} reports the disk usage by sstables. The test creates table, populates, flushes and checks that the size corresonds to what stat(2) reports for the respective files. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-04 19:52:33 +03:00
Nikos Dragazis	03902e5f17	cql3: untyped_result_set: Store rows in chunked_vector The `untyped_result_set` stores rows in std::vector. Switch to `chunked_vector` to prevent large allocations and data copies. One such case is in secondary index queries, where we convert the result of the internal index view query into an `untyped_result_set` for processing. The result is bound by the page size memory limit (1MiB by default), so it can cause large allocations of this magnitude. This patch aligns `untyped_result_set` with `result_set`, which also uses a `chunked_vector`. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-03-04 18:39:32 +02:00
Nikos Dragazis	892690b953	test: Reproduce bug with large allocations from secondary index Secondary index queries which fetch partitions from the base table can cause large allocations that can lead to reactor stalls. Reproduce this with a unit test that runs an indexed query on a table with thousands of single-row partitions, and checks the memory stats for any large contiguous allocations. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-03-04 18:39:28 +02:00
Patryk Jędrzejczak	c13b6c91d3	Merge 'raft topology: drop changing the raft voters config via storage_service' from Emil Maskovsky For the limited voters feature to work properly we need to make sure that we are only managing the voter status through the topology coordinator. This means that we should not change the node votership from the storage_service module for the raft topology directly. We can drop the voter status changes from the storage_service module because the topology coordinator will handle the votership changes eventually. The calls in the storage_service module were not essential and were only used for optimization (improving the HA under certain conditions). Furthermore, the other bundled commit improves the reaction again by reacting to the node `on_up()` and `on_down()` events, which again shortens the reaction time and improves the HA. The change has effect on the timing in the tablets migration test though, as it previously relied on the node being made non-voter from the service_storage `raft_removenode()` function. The fix is to add another server to the topology to make sure we will keep the quorum. Previously the test worked because the test waits for an injection to be reached and it was ensured that the injection (log line) has only been triggered after the node has been made non-voter from the `raft_removenode()`. This is not the case anymore. An alternative fix would be to wait for the first node to be made non-voter before stopping the second server, but this would make the test more complex (and it is not strictly required to only use 4 servers in the test, it has been only done for optimization purposes). Fixes: scylladb/scylladb#22860 Refs: scylladb/scylladb#18793 Refs: scylladb/scylladb#21969 No backport: Part of the limited voters new feature, so this shouldn't to be backported. Closes scylladb/scylladb#22847 * https://github.com/scylladb/scylladb: raft: use direct return of future for `run_op_with_retry` raft: adjust the voters interface to allow atomic changes raft topology: drop removing the node from raft config via storage_service raft topology: drop changing the raft voters config via storage_service	2025-03-04 13:59:47 +01:00
Nadav Har'El	d096aac200	test/cqlpy/run: reduce number of tablets In commit `2463e524ed`, Scylla's default changed from starting with one tablet per shard to starting 10 per shard. The functional tests don't need more tablets and it can only slow down the tests, so the patch added --tablets-initial-scale-factor=1 to test//suite.yaml but forgot to add it to test/cqlpy/run.py (to affect test/cqlpy/run) so this patch does this now. This patch should only* be about making tests faster, although to be honest, I don't see any measurable improvement in test speed (10 isn't so many). But, unfortunately, this is only part of the story. Over time we allowed a few cqlpy tests to be written in a way that relies on having only a small number of tablets or even exactly one tablet per shard (!). These tests are buggy and should be fixed - see issues #23115 and #23116 as examples. But adding the option --tablets-initial-scale-factor=1 also to run.py will make these bugs not affect test/cqlpy/run in the same way as it doesn't affect test.py. These buggy tests will still break with `pytest cqlpy` against a Scylla you ran yourself manually, so eventually will still need to fix those test bugs. Refs #23115 Refs #23116 Closes scylladb/scylladb#23125	2025-03-04 15:39:21 +03:00
Asias He	60913312af	repair: Enable small table optimization for system_replicated_keys This enterprise-only system table is replicated and small. It should be included for small table optimization. Fixes scylladb/scylla-enterprise#5256 Closes scylladb/scylladb#23135	2025-03-04 12:40:56 +02:00
Artsiom Mishuta	97a620cda9	test.py: merge object_store into cluster folder Now that we support suite subfolders, there is no need to create an own suite for object_store	2025-03-04 10:32:44 +01:00
Artsiom Mishuta	a283b391c2	test.py: merge auth_cluster into cluster folter Now that we support suite subfolders, there is no need to create an own suite for auth_cluster	2025-03-04 10:32:44 +01:00
Artsiom Mishuta	d1198f8318	test.py: rename topology_custom folder to cluster rename topology_custom folder to cluster as it contains not only topology test cases	2025-03-04 10:32:44 +01:00
Artsiom Mishuta	d8e17c4356	test.py: merge topology test suite into topology_custom Now that we support suite subfolders, there is no need to create an own suite for topology	2025-03-04 10:32:44 +01:00
Artsiom Mishuta	ef62dfa6a9	test.py delete conftest in topology_custom delete conftest in the sepatate commi for brtter diff listing during merge topology_custom and topology	2025-03-04 10:32:43 +01:00
Artsiom Mishuta	cf48444e3b	test.py apply prepare_3_nodes_cluster in topology apply prepare_3_nodes_cluster for all tests in the topology folder via applying mark at the test module level using pytestmark https://docs.pytest.org/en/stable/example/markers.html#marking-whole-classes-or-modules set initial initial_size for topology folder to 0	2025-03-04 10:32:43 +01:00
Artsiom Mishuta	20777d7fc6	test.py: introduce prepare_3_nodes_cluster marker prepare_3_nodes_cluster marker will allow preparing non-dirty 3 nodes cluster that can be reused between tests	2025-03-04 10:32:43 +01:00
Nadav Har'El	a56751e71b	test/cqlpy: fix test assuming just one tablet The cqlpy test test_compaction.py::test_compactionstats_after_major_compaction was written to assume we have just one tablet per shard - if there are many tablets compaction splitting the data, the test scenario might not need compaction in the way that the test assumes it does. Recently (commit `2463e524ed`) Scylla's default was changed to have 10 tablets per shard - not one. This broke this test. The same commit modified test/cqlpy/suite.yaml, but that affects only test.py and not test/cqlpy/run, and also not manual runs against a manually-installed Scylla. If this test absolutely requires a keyspace with 1 and not 10 tablets, then it should create one explicitly. So this is what this test does (but only if tablets are in use; if vnodes are used that's fine too). Before this patch, test/cqlpy/run test_compaction.py::test_compactionstats_after_major_compaction fails. After the patch, it passes. Fixes #23116 Closes scylladb/scylladb#23121	2025-03-04 10:15:29 +02:00
Kefu Chai	a43072a21e	cql3,test: replace boost::range::adjacent_find with std::ranges to reduce third-party dependencies and modernize the codebase. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22998	2025-03-04 10:08:02 +02:00
Artsiom Mishuta	d7f9c5654b	test.py: change test uname This commit change the test uname replacement fron "_" to "." to be able support sub-folders in scylla-pkg scripts logic Closes scylladb/scylladb#23130	2025-03-04 09:58:58 +02:00
Wojciech Mitros	dae7221342	rust: update dependencies The currently used versions of "wasmtime", "idna", "cap-std" and "cap-primitives" packages had low to moderate security issues. In this patch we update the dependencies to versions with these issues fixed. The update was performed by changing the "wasmtime" (and "wasmtime-wasi") version in rust/wasmtime_bindings/Cargo.toml and updating rust/Cargo.lock using the "cargo update" command with the affected package. To fix an issue with different dependencies having different versions of sub-dependencies, the package "smallvec" was also updated to "1.13.1". After the dependency update, the Rust code also needed to be updated because of the slightly changed API. One Wasm test case needed to be updated, as it was actually using an incorrect Wat module and not failing before. The crate also no longer allows multiple tables in Wasm modules by default - it is now enabled by setting the "gc" crate feature and configuring the Engine with config.wasm_reference_types(true). Fixes https://github.com/scylladb/scylladb/issues/23127 Closes scylladb/scylladb#23128	2025-03-04 09:45:23 +02:00
Pavel Emelyanov	e4e15a00b7	Merge 'reader_concurrency_semaphore: register_inactive_read(): handle aborted permit' from Botond Dénes It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up. Fixes: scylladb/scylladb#22919 Bug is present in all live versions, backports are required. Closes scylladb/scylladb#23044 * github.com:scylladb/scylladb: reader_concurrency_semaphore: register_inactive_read(): handle aborted permit test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()	2025-03-04 10:40:28 +03:00
Botond Dénes	71d8b7aa9f	querier: demote tombstone warning for range-scans to debug level Range scans are expected to go though lots of tombstones, no need to spam the logs about this. The tombstone warning log is demoted to debug level, if somebody wants to see it they can bump the logger to debug level. Fixes: https://github.com/scylladb/scylladb/issues/23093 Closes scylladb/scylladb#23094	2025-03-04 10:38:06 +03:00
Kefu Chai	a483ff8647	mutation: replace boost::upper_bound with std::ranges::upper_bound Reduces dependencies on boost/range. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23119	2025-03-04 10:36:57 +03:00
Kefu Chai	a20cd6539c	cql3, dht: Remove redundant std::move() calls These redundant `std::move()` calls were identified by GCC-14. In general, copy elision applies to these places, so adding `std::move()` is not only unnecessary but can actually prevent the compiler from performing copy elision, as it causes the return statement to fail to satisfy the requirements for copy elision optimization. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23063	2025-03-04 10:36:49 +03:00
Botond Dénes	6f7a069bce	Merge 'Label basic metrics' from Amnon Heiman This series is part of the effort to reduce the overall overhead originating from metrics reporting, both on the Scylla side and the metrics collecting server (Prometheus or similar) The idea in this series is to create an equivalent of levels with a label. First, label a subset of the metrics used by the dashboards. Second, the per-table metrics that are now off by default will be marked with a different label. The following specific optional features: CDC, CAS, and Alternator have a dedicated label now. This will allow users to disable all metrics of features that are not in use. All the rest of the metrics are left unlabeled. Without any changes, users would get the same metrics they are getting today. But you could pass the `__level=1` and get only those metrics the dashboard needs. That reduces between 50% and 70% (many metrics are hidden if not used, so the overall number of metrics varies). The labels are not reported based on the seastar feature of hiding labels that start with an underscore. Closes scylladb/scylladb#12246 * github.com:scylladb/scylladb: db/view/view.cc: label metrics with basic_level transport/server.cc: label metrics with basic_level service/storage_proxy.cc: label metrics with basic_level and cas main.cc: label metrics with basic_level streaming/stream_manager.cc: label metrics with basic_level repair/repair.cc: label metrics with basic_level service/storage_service.cc: label metrics with basic_level gms/gossiper.cc: label metrics with basic_level replica/database.cc: label metrics with basic_level cdc/log.cc: label metrics with basic_level and cdc alternator: label metrics with basic_level and alternator row_cache.cc: label metrics with basic_level query_processor.cc: label metrics with basic_level sstables.cc: label metrics with basic_level utils/logalloc.cc label metrics with basic_level commitlog.cc: label metrics with basic_level compaction_manager.cc: label metrics with basic_level Adding the __level and features labels	2025-03-04 09:32:11 +02:00
Calle Wilund	2f10205714	config: Enable optional TLS1.3 session ticket usage in cert setup Refs #22916 Adds an "enable_session_tickets" option to TLS setup for our server endpoints (not documented for internode RPC, as we don't handle it on the client side there), allowing enabling of TLS3 client session ticket, i.e. quicker reconnect. Session tickets are valid within a time frame or until a node restarts, whichever comes first. v2: Use "TLS1.3" in help message Closes scylladb/scylladb#22928	2025-03-04 09:30:53 +02:00
Amnon Heiman	19a414598b	db/view/view.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_view_builder_builds_in_progress Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Amnon Heiman	9518a85ad0	transport/server.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_transport_cql_errors_total scylla_transport_current_connections scylla_transport_requests_served scylla_transport_requests_shed Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Amnon Heiman	cbae9a4abe	service/storage_proxy.cc: label metrics with basic_level and cas The following metrics will be marked with basic_level label: scylla_storage_proxy_coordinator_background_reads scylla_storage_proxy_coordinator_background_writes scylla_storage_proxy_coordinator_cas_background scylla_storage_proxy_coordinator_cas_dropped_prune scylla_storage_proxy_coordinator_cas_failed_read_round_optimization scylla_storage_proxy_coordinator_cas_foreground scylla_storage_proxy_coordinator_cas_prune scylla_storage_proxy_coordinator_cas_read_contention_bucket scylla_storage_proxy_coordinator_cas_read_contention_count scylla_storage_proxy_coordinator_cas_read_latency_count scylla_storage_proxy_coordinator_cas_read_latency_sum scylla_storage_proxy_coordinator_cas_read_timeouts scylla_storage_proxy_coordinator_cas_read_unavailable scylla_storage_proxy_coordinator_cas_read_unfinished_commit scylla_storage_proxy_coordinator_cas_total_operations scylla_storage_proxy_coordinator_cas_write_condition_not_met scylla_storage_proxy_coordinator_cas_write_contention_count scylla_storage_proxy_coordinator_cas_write_latency_count scylla_storage_proxy_coordinator_cas_write_latency_sum scylla_storage_proxy_coordinator_cas_write_timeout_due_to_uncertainty scylla_storage_proxy_coordinator_cas_write_timeouts scylla_storage_proxy_coordinator_cas_write_unavailable scylla_storage_proxy_coordinator_cas_write_unfinished_commit scylla_storage_proxy_coordinator_current_throttled_base_writes scylla_storage_proxy_coordinator_foreground_reads scylla_storage_proxy_coordinator_foreground_writes scylla_storage_proxy_coordinator_range_timeouts scylla_storage_proxy_coordinator_range_unavailable scylla_storage_proxy_coordinator_read_errors_local_node scylla_storage_proxy_coordinator_read_latency_count scylla_storage_proxy_coordinator_read_latency_sum scylla_storage_proxy_coordinator_reads_local_node scylla_storage_proxy_coordinator_reads_remote_node scylla_storage_proxy_coordinator_read_timeouts scylla_storage_proxy_coordinator_read_unavailable scylla_storage_proxy_coordinator_speculative_data_reads scylla_storage_proxy_coordinator_speculative_digest_reads scylla_storage_proxy_coordinator_total_write_attempts_local_node scylla_storage_proxy_coordinator_write_errors_local_node scylla_storage_proxy_coordinator_write_latency_bucket scylla_storage_proxy_coordinator_write_latency_count scylla_storage_proxy_coordinator_write_latency_sum scylla_storage_proxy_coordinator_write_timeouts scylla_storage_proxy_coordinator_write_unavailable scylla_storage_proxy_replica_received_counter_updates All cas related metrics are labeled with __cas label. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Amnon Heiman	fd5d1f1f6a	main.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_scylladb_current_version scylla_reactor_utilization Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Amnon Heiman	5747af8555	streaming/stream_manager.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_node_ops_finished_percentage Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Amnon Heiman	48397f8dff	repair/repair.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_node_ops_finished_percentage Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Amnon Heiman	83bfcb53be	service/storage_service.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_node_operation_mode Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Amnon Heiman	1b64fa2283	gms/gossiper.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_gossip_heart_beat scylla_gossip_live scylla_gossip_unreachable Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Amnon Heiman	cfc5c60ba5	replica/database.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_database_active_reads scylla_database_dropped_view_updates scylla_database_queued_reads scylla_database_requests_blocked_memory scylla_database_requests_blocked_memory_current scylla_database_schema_changed scylla_database_total_reads scylla_database_total_reads_failed scylla_database_total_view_updates_pushed_local scylla_database_total_view_updates_pushed_remote scylla_database_total_writes scylla_database_total_writes_failed scylla_database_total_writes_timedout scylla_database_total_writes_rate_limited scylla_database_view_update_backlog Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Amnon Heiman	cf50c71ef5	cdc/log.cc: label metrics with basic_level and cdc The following metrics will be marked with basic_level label: scylla_cdc_operations_failed scylla_cdc_operations_total All metrics are labeld with the __cdc label. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:38 +02:00
Amnon Heiman	a474e95ef0	alternator: label metrics with basic_level and alternator The following metrics will be marked with basic_level label: scylla_alternator_operation scylla_alternator_op_latency_bucket scylla_alternator_op_latency_count scylla_alternator_op_latency_sum scylla_alternator_total_operations scylla_alternator_batch_item_count scylla_alternator_op_latency scylla_alternator_op_latency_summary scylla_expiration_items_deleted All alternator metrics are marked with __alternator label. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:38 +02:00
Amnon Heiman	f40dc4e5c4	row_cache.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_cache_bytes_total scylla_cache_bytes_used scylla_cache_partition_evictions scylla_cache_partition_hits scylla_cache_partition_insertions scylla_cache_partition_merges scylla_cache_partition_misses scylla_cache_partition_removals scylla_cache_range_tombstone_reads scylla_cache_reads scylla_cache_reads_with_misses scylla_cache_row_evictions scylla_cache_row_hits scylla_cache_row_insertions scylla_cache_row_misses scylla_cache_row_removals scylla_cache_rows scylla_cache_rows_merged_from_memtable scylla_cache_row_tombstone_reads Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:38 +02:00
Amnon Heiman	0dde54d053	query_processor.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_cql_authorized_prepared_statements_cache_evictions scylla_cql_batches scylla_cql_deletes scylla_cql_deletes_per_ks scylla_cql_filtered_read_requests scylla_cql_filtered_rows_dropped_total scylla_cql_filtered_rows_matched_total scylla_cql_filtered_rows_read_total scylla_cql_inserts scylla_cql_inserts_per_ks scylla_cql_prepared_cache_evictions scylla_cql_reads scylla_cql_reads_per_ks scylla_cql_reverse_queries scylla_cql_rows_read scylla_cql_secondary_index_reads scylla_cql_select_bypass_caches scylla_cql_select_partition_range_scan_no_bypass_cache scylla_cql_statements_in_batches scylla_cql_unpaged_select_queries scylla_cql_unpaged_select_queries_per_ks scylla_cql_updates scylla_cql_updates_per_ks	2025-03-03 16:58:38 +02:00
Amnon Heiman	94ba8af788	sstables.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_sstables_cell_tombstone_writes scylla_sstables_range_tombstone_reads scylla_sstables_range_tombstone_writes scylla_sstables_row_tombstone_reads scylla_sstables_tombstone_writes	2025-03-03 16:58:38 +02:00
Amnon Heiman	bf39a760aa	utils/logalloc.cc label metrics with basic_level The following metrics will be marked with basic_level label: scylla_lsa_total_space_bytes scylla_lsa_non_lsa_used_space_bytes Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:38 +02:00
Amnon Heiman	6826b98c88	commitlog.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_commitlog_segments scylla_commitlog_allocating_segments scylla_commitlog_unused_segments scylla_commitlog_alloc scylla_commitlog_flush scylla_commitlog_bytes_written scylla_commitlog_pending_allocations scylla_commitlog_requests_blocked_memory scylla_commitlog_flush_limit_exceeded scylla_commitlog_disk_total_bytes scylla_commitlog_disk_active_bytes scylla_commitlog_disk_slack_end_bytes	2025-03-03 16:58:38 +02:00
Amnon Heiman	67ca02b361	compaction_manager.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_compaction_manager_compactions	2025-03-03 16:58:38 +02:00
Amnon Heiman	30b34d29b2	Adding the __level and features labels Scylla generates many metrics, and when multiplied by the number of shards, the total number of metrics adds a significant load to a monitoring server. With multi-tier monitoring, it is helpful to have a smaller subset of metrics users care about and allow them to get only those. This patch adds two kind of labels, the a __level label, currently with a single value, but we can add more in the future. The second kind, is a cross feature label, curently for alternator, cdc and cas. We will use the __level label to mark the interesting user-facing metrics. The current level value is: basic - metrics for Scylla monitoring In this phase, basic will mark all metrics used in the dashboards. In practice, without any configuration change, Prometheus would get the same metrics as it gets today. While it is possible to filter by the label, e.g.: curl http://localhost:9180/metrics?__level=basic The labels themselves are not reported thanks to label filtering of labels begin with __. The feature labels: __cdc, __cas and __alternator can be an easy way to disable a set of metrics when not using a feature. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:38 +02:00
Emil Maskovsky	8c67307971	raft: use direct return of future for `run_op_with_retry` Clean up the code by using direct return of future for `run_op_with_retry`. This can be done as the `run_op_with_retry` function is already returning a future that we can reuse directly. What needs to be taken care of is to not use temporaries referenced from inside the lambda passed to the `run_op_with_retry`.	2025-03-03 15:19:58 +01:00
Emil Maskovsky	28d1aeb1fa	raft: adjust the voters interface to allow atomic changes Allow setting the voters and non-voters in a single operation. This ensures that the configuration changes are done atomically. In particular, we don't want to set voters and non-voters separately because it could lead to inconsistencies or even the loss of quorum. This change also partially reverts the commit `115005d`, as we will only need the convenience wrappers for removing the voters (not for adding them). Refs: scylladb/scylladb#18793	2025-03-03 15:19:58 +01:00
Emil Maskovsky	074f4fcdf1	raft topology: drop removing the node from raft config via storage_service For the limited voters feature to work properly we need to make sure that we are only managing the voter status through the topology coordinator. This means that we should not change the node votership from the storage_service module for the raft topology directly. This needs to be done in addition to dropping of the votership change from the storage_service module. The `remove_from_raft_config` is redundant and can be removed because a successfully completed `removenode` operation implies that the node has been removed from group 0 by the topology coordinator. Refs: scylladb/scylladb#22860 Refs: scylladb/scylladb#18793 Refs: scylladb/scylladb#21969	2025-03-03 15:15:43 +01:00
Emil Maskovsky	834f506790	raft topology: drop changing the raft voters config via storage_service For the limited voters feature to work properly we need to make sure that we are only managing the voter status through the topology coordinator. This means that we should not change the node votership from the storage_service module for the raft topology directly. We can drop the voter status changes from the storage_service module because the topology coordinator will handle the votership changes eventually. The calls in the storage_service module were not essential and were only used for optimization (improving the HA under certain conditions). This has effect on the timing in the tablets migration test though, as it relied on the node being made non-voter from the service_storage `raft_removenode()` function. The fix is to add another server to the topology to make sure we will keep the quorum. Previously the test worked because the test waits for an injection to be reached and it was ensured that the injection (log line) has only been triggered after the node has been made non-voter from the `raft_removenode()`. This is not the case anymore. An alternative fix would be to wait for the first node to be made non-voter before stopping the second server, but this would make the test more complex (and it is not strictly required to only use 4 servers in the test, it has been only done for optimization purposes). Fixes: scylladb/scylladb#22860 Refs: scylladb/scylladb#18793 Refs: scylladb/scylladb#21969	2025-03-03 15:15:43 +01:00
Artsiom Mishuta	90106c6f19	test.py: skip test_incremental_read_repair[row-tombstone] skip test test_incremental_read_repair[row-tombstone] due to https://github.com/scylladb/scylladb/issues/21179 Closes scylladb/scylladb#23126	2025-03-03 15:26:28 +02:00
Nadav Har'El	ea19b79fe2	Merge 'De-duplicate API's table name to table ID conversion' from Pavel Emelyanov This is continuation of #21533 There are two almost identical helpers in api/ -- validate_table(ks, cf) and get_uuid(ks, cf). Both check if the ks:cf table exists, throwing bad_param_exception if it doesn't. There's slight difference in their usage, namely -- callers of the latter one get the table_id found and make use of it, while the former helper is void and its callers need to re-search for the uuid again if the need (spoiler: they do). This PR merges two helpers together, so there's less code to maintain. As a nice side effect, the existing validate_table() callers save one re-lookup of the ks:cf pair in database mappings. Affected endpoints are validated by existing tests: * column_family/{autocompation\|tombstone_gc\|compaction_strategy}, validated by the tests described in #21533 * /storage_service/{range_to_endpoint_map\|describe_ring\|ownership}, validated by nodetool tests * /storage_service/tablets/{move\|repair}, validated by tablets move and repair tests Closes scylladb/scylladb#22742 * github.com:scylladb/scylladb: api: Remove get_uuid() local helper api: Make use of validate_table()'s table_id api: Make validate_table() helper return table_id after validation api: Change validate_table()'s ctx argument to database	2025-03-03 13:39:50 +02:00
Kefu Chai	5571b537b5	tree: Make values mutable to enable move semantics Previously, variables were marked as const, causing std::move() calls to be redundant as reported by GCC warnings. This change either removes const qualifiers or marks related lambdas as mutable, allowing the compiler to properly utilize move constructors for better performance. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23066	2025-03-03 13:53:02 +03:00
Evgeniy Naydanov	cb0e0ebcf7	test.py: extract prepare dirs and S3 mock steps to test/conftest.py As a part of the moving to bare pytest we need to extract the required test environment preparation steps into pytest's hooks/fixtures. Do this for S3 mock stuff (MinioServer, MockS3Server, and S3ProxyServer) and for directories with test artifacts. For compatibility reason add --test-py-init CLI option for bare pytest test runner: need to add it to pytest command if you need test.py stuff in your tests (boost, topology, etc.) Also, postpone initialization of TestSuite.artifacts and TestSuite.hosts from import-time to runtime. Closes scylladb/scylladb#23087	2025-03-03 13:24:37 +03:00
Kefu Chai	a3ac7c3d33	remove redundant std::move() from position_in_partition::key() Fix GCC warning about moving from a const reference in mp_row_consumer_k_l::flush_if_needed. Since position_in_partition::key() returns a const reference, std::move has no effect. Considered adding an rvalue reference overload (clustering_key_prefix&& key() &&) but since the "me" sstable format is mandatory since `63b266e9`, this approach offers no benefit. This change simply removes the redundant std::move() call to silence the warning and improve code clarity. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23085	2025-03-03 12:51:40 +03:00
Piotr Dulikowski	d402a19e9a	Merge 'replica: Prepare full keyspace config in make_keyspace_config()' from Pavel Emelyanov Currently for system keyspace part of config members are configured outside of this helper, in the caller. It's more consistent to have full config initialization in one place. Closes scylladb/scylladb#22975 * github.com:scylladb/scylladb: replica: Mark database::make_keyspace_config() private replica: Prepare full keyspace config in make_keyspace_config()	2025-03-03 10:44:42 +01:00
Kefu Chai	65bc6b449e	scripts/open-coredump.sh: use the remote repo containing given sha1 Enhance how the script handles remote repository selection for a given SHA1 commit hash. Previously, in `3bdbe620`, the script fetched from all remotes containing the product name, which could lead to inefficiencies and errors, especially with multiple matching remotes. Now, it first checks if the SHA1 is in any local remote-tracking branch, using that remote if found, and otherwise fetches from each remote sequentially to find the first one containing the SHA1. This approach minimizes unnecessary fetches, making the script more efficient for debugging coredumps in repositories with multiple remotes. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23026	2025-03-03 08:22:41 +02:00
Paweł Zakrzewski	9e7f79d1ab	cql3/select_statement: require LIMIT and PER PARTITION LIMIT to be strictly positive LIMIT and PER PARTITION LIMIT limit the number of rows returned or taken into consideration by a query. It makes no logical sense to have this value at less than 1. Cassandra also has this requirement. This patch ensures that the limit value is strictly positive and adds an explicit test for it - it was only tested in a test ported from Cassandra, that is disabled due to other issues. Closes scylladb/scylladb#23013	2025-03-03 08:13:27 +02:00
Tomasz Grabiec	0343235aa2	Merge 'tablets: repair: fix hosts and dcs filters behavior for tablet repair' from Aleksandra Martyniuk If hosts and/or dcs filters are specified for tablet repair and some replicas match these filters, choose the replica that will be the repair master according to round-robin principle (currently it's always the first replica). If hosts and/or dcs filters are specified for tablet repair and no replica matches these filters, the repair succeeds and the repair request is removed (currently an exception is thrown and tablet repair scheduler reschedules the repair forever). Fixes: https://github.com/scylladb/scylladb/issues/23100. Needs backport to 2025.1 that introduces hosts and dcs filters for tablet repair Closes scylladb/scylladb#23101 * github.com:scylladb/scylladb: test: add new cases to tablet_repair tests test: extract repiar check to function locator: add round-robin selection of filtered replicas locator: add tablet_task_info::selected_by_filters service: finish repair successfully if no matching replica found	2025-03-01 14:47:43 +01:00
Jenkins Promoter	7b50fbafb3	Update pgo profiles - aarch64	2025-03-01 04:58:49 +02:00
Jenkins Promoter	84e1514152	Update pgo profiles - x86_64	2025-03-01 04:26:11 +02:00
Anna Stuchlik	850aec58e0	doc: add the 2025.1 upgrade guides and reorganize the upgrade section This commit adds the upgrade guides relevant in version 2025.1: - From 6.2 to 2025.1 - From 2024.x to 2025.1 It also removes the upgrade guides that are not relevant in 2025.1 source available: - Open Source upgrade guides - From Open Source to Enterprise upgrade guides - Links to the Enterprise upgrade guides Also, as part of this PR, the remaining relevant content has been moved to the new About Upgrade page. WHAT NEEDS TO BE REVIEWED - Review the instructions in the 6.2-to-2025.1 guide - Review the instructions in the 2024.x-to-2025.1 guide - Verify that there are no references to Open Source and Enterprise. The scope of this PR does not have to include metrics - the info can be added in a follow-up PR. Fixes https://github.com/scylladb/scylladb/issues/22208 Fixes https://github.com/scylladb/scylladb/issues/22209 Fixes https://github.com/scylladb/scylladb/issues/23072 Fixes https://github.com/scylladb/scylladb/issues/22346 Closes scylladb/scylladb#22352	2025-02-28 15:18:34 +03:00
Aleksandra Martyniuk	c7c6d820d7	test: add new cases to tablet_repair tests Add tests for tablet repair with host and dc filters that select one or no replica.	2025-02-28 13:03:04 +01:00
Aleksandra Martyniuk	c40eaa0577	test: extract repiar check to function	2025-02-28 13:01:10 +01:00
Aleksandra Martyniuk	2b538d228c	locator: add round-robin selection of filtered replicas	2025-02-28 12:32:55 +01:00
Aleksandra Martyniuk	fe4e99d7b3	locator: add tablet_task_info::selected_by_filters Extract dcs and hosts filters check to a method.	2025-02-28 12:02:21 +01:00
Kefu Chai	af6895548c	cql3: result_set: Initialize result_generator::_stats to prevent undefined behavior Previously, when result_generator's default constructor was called, the _stats member variable remained uninitialized. This could lead to undefined behavior in release builds where uninitialized values are unpredictable, making issues difficult to debug. This change initializes the pointer to nullptr, ensuring consistent behavior across all build types and preventing potential memory-related bugs. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23073	2025-02-28 13:57:13 +03:00
Kefu Chai	9e0e99347f	sstables: explicitly call parent's default constructor in copy constructor When implementing the copy constructor for `sstable_set` (derived from `enable_lw_shared_from_this`), we intentionally need the parent's default constructor rather than its copy constructor. This is because each new `sstable_set` instance maintains its own reference count and owns a clone of the source object's implementation (`x._impl->clone()`). Although this behavior is correct, GCC warns about not calling the parent's copy constructor. This change explicitly calls the parent's default constructor to: 1. Silence GCC warnings 2. Clearly document our intention to use the default constructor 3. Follow best practices for constructor initialization The functionality remains unchanged, but the code is now more explicit about its design and free of compiler warnings. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23083	2025-02-28 13:52:24 +03:00
Aleksandra Martyniuk	9bce40d917	service: finish repair successfully if no matching replica found If hosts and/or dcs filters are specified for tablet repair and no replica matches these filters, an exception is thrown. The repair fails and tablet repair scheduler reschedules it forever. Such a repair should actually succeed (as all specified relpicas were repaired) and the repair request should be removed. Treat the repair as successful if the filters were specified and selected no replica.	2025-02-28 11:50:52 +01:00
Botond Dénes	7ba29ec46c	reader_concurrency_semaphore: register_inactive_read(): handle aborted permit It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up. Fixes: scylladb/scylladb#22919	2025-02-28 01:32:46 -05:00
Botond Dénes	4d8eb02b8d	test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now() Unless the test in question actually wants to test timeouts. Timeouts will have more pronounced consequences soon and thus using db::timeout_clock::now() becomes a sure way to make tests flaky. To avoid this, use db::no_timeout in the tests that don't care about timeouts.	2025-02-28 01:31:33 -05:00
Anna Stuchlik	439463dbbf	doc: add support for Ubuntu 24.04 in 2024.1 Fixes https://github.com/scylladb/scylladb/issues/22841 Refs https://github.com/scylladb/scylla-enterprise/issues/4550 Closes scylladb/scylladb#22843	2025-02-27 15:12:31 +03:00
Anna Stuchlik	0999fad279	doc: add information about tablets limitation to the CQL page This commit adds a link to the Limitations section on the Tablets page to the CQL pag, the tablets option. This is actually the place where the user will need the information: when creating a keyspace. In addition, I've reorganized the section for better readability (otherwise, the section about limitations was easy to miss) and moved the section up on the page. Note that I've removed the updated content from the `_common` folder (which I deleted) to the .rst page - we no longer split OSS and Enterprise, so there's no need to keep using the `scylladb_include_flag` directive to include OSS- and Ent-specific content. Fixes https://github.com/scylladb/scylladb/issues/22892 Fixes https://github.com/scylladb/scylladb/issues/22940 Closes scylladb/scylladb#22939	2025-02-27 15:11:19 +03:00
Asias He	3f59a89e85	repair: Fix return type for storage_service/tablets/repair API The API returns the repair task UUID. For example: {"tablet_task_id":"3597e990-dc4f-11ef-b961-95d5ead302a7"} Fixes #23032 Closes scylladb/scylladb#23050	2025-02-27 12:38:12 +02:00
Kefu Chai	834450f604	github: Skip clang-tidy when not explicitly requested Previously, the clang-tidy.yaml workflow would cancel the clang-tidy job when a comment wasn't prefixed with "/clang-tidy", instead of skipping it. This cancellation triggered unnecessary email notifications for developers with GitHub action notifications enabled. This change modifies the workflow to only run clang-tidy when the read-toolchain job succeeds, reducing notification noise by properly skipping the job rather than cancelling it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23084	2025-02-27 13:28:35 +03:00
Artsiom Mishuta	cd5d34f9b7	test.py: fix failed_test collection after introducing the test.py subfolders support, test.py start creating weird log files like testlog/topology_custom.mv/tablets/test_mv_tablets.1 that affect failed test collection logic this commit fixes this and test.py logs as previously in testlog directory without any subfolders: topology_custom.mv_tablets_test_mv_tablets.1 Closes scylladb/scylladb#23009	2025-02-27 12:37:11 +03:00
Avi Kivity	3f05fa3a9b	test: lib: replace boost::generate with std equivalent Reduces dependencies on boost/range. Closes scylladb/scylladb#23034	2025-02-27 01:05:46 +01:00
Kefu Chai	c45f9b7155	utils/sorting: Fix VerticesContainer concept constraints Fix a bug where std::same_as<...> constraint was incorrectly used as a simple requirement instead of a nested requirement or part of a conjunction. This caused the constraint to be always satisfied regardless of the actual types involved. This change promotes std::same_as<...> to a top-level constraint, ensuring proper type checking while improving code readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23068	2025-02-26 23:23:53 +02:00
Kefu Chai	6e4cb20a69	tree: implement boost::accumulate with std::ranges library Replace boost::accumulate() calls with std::ranges facilities. This change reduces external dependencies and modernizes the codebase. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23062	2025-02-26 23:22:02 +02:00
Kefu Chai	41dd004c20	conf: scylla.yaml: correct a misspelling s/ommitted/omitted/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23055	2025-02-26 23:19:56 +02:00
Pavel Emelyanov	27e96be6ad	B+tree: Clean const_iterator->iterator conversion The tree code have const and non-const overloads for searching methods like find(), lower_bound(), etc. Not to implement them twice, it's coded like const_iterator find() const { ... // the implementation itself } iterator find() { return iterator(const_cast<const *>(this)->find()); } i.e. -- const overload is called, and returned by it const_iterator is converted into a non-const iterator. For that the latter has dedicated constructor with two inaccuracies: it's not marked as explicit and it accepts const rvalue reference. This patch fixes both. Althogh this disables implicit const -> non-const conversion of iterators, the constructor in question is public, which still opens a way for conversion (without const_cast<>). This constructor is better be marked private, but there's double_decker class that uses bptree and exploits the same hacks in its finding methods, so it needs this constructor to be callable. Alas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23069	2025-02-26 23:17:27 +02:00
Kefu Chai	da9960db1c	tree: Fix polymorphic exception handling by using references Replace value-based exception catching with reference-based catching to address GCC warnings about polymorphic type slicing: ``` warning: catching polymorphic type ‘class seastar::rpc::stream_closed’ by value [-Wcatch-value=] ``` When catching polymorphic exceptions by value, the C++ runtime copies the thrown exception into a new instance of the specified type, slicing the actual exception and potentially losing important information. This change ensures all polymorphic exceptions are caught by reference to preserve the complete exception state. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23064	2025-02-26 23:15:16 +02:00
Piotr Szymaniak	f887466c3f	alternator: Clean error handling on CreateTable without AttributeDefinitions If user fails to supply the AttributeDefinitions parameter when creating a table, Scylla used to fail on RAPIDJSON_ASSERT. Now it calls a polite exception, which is fully in-line with what DynamoDB does. The commit supplies also a new, relevant test routine. Fixes #23043 Closes scylladb/scylladb#23041	2025-02-26 14:24:57 +02:00
Botond Dénes	5d63ef4d15	Merge 'scylla sstable: Add standard extensions and propagate to schema load ' from Calle Wilund Fixes #22314 Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them. Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points. Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line. Closes scylladb/scylladb#22327 * github.com:scylladb/scylladb: tools: Add standard extensions and propagate to schema load cql_test_env: Use add all extensions instead of inidividually main: Move extensions adding to function tomstone_gc: Make validate work for tools	2025-02-26 13:52:47 +02:00
Kefu Chai	6e4df57f97	mutation,test: replace boost::equal with std::ranges::equal to reduce third-party dependencies and modernize the codebase. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22999	2025-02-26 14:27:42 +03:00
Andrzej Jackowski	b4f0a5149a	db: cql3: add comments regarding unsafe interval<clustering_key_prefix> class clustering_range is a range of Clustering Key Prefixes implemented as interval<clustering_key_prefix>. However, due to the nature of Clustering Key Prefix, the ordering of clustering_range is complex and does not satisfy the invariant of interval<>. To be more specific, as a comment in interval<> implementation states: “The end bound can never be smaller than the start bound”. As a range of CKP violates the invariant, some algorithms, like intersection(), can return incorrect results. For more details refer to scylladb#8157, scylladb#21604, scylladb#22817. This commit: - Add a WARNING comment to discourage usage of clustering_range - Add WARNING comments to potentially incorrect uses of interval<clustering_key_prefix> non-trivial methods - Add a FIXME comment to incorrect use of interval<clustering_key_prefix_view>::deoverlap and WARNING comments to related interval<clustering_key_prefix_view> misuse. Closes scylladb/scylladb#22913	2025-02-26 12:01:28 +01:00
Wojciech Mitros	6bc445b841	test: increase timeout for adding a server in test_mv_topology_change Currently, when we add servers to the cluster in the test, we use a 60s timeout which proved to be not enough in one of the debug runs. There is no reason for this test to use a shorter timeout than all the other tests, so in this patch we reset it to the higher default. Fixes https://github.com/scylladb/scylladb/issues/23047 Closes scylladb/scylladb#23048	2025-02-26 10:18:05 +02:00
Pavel Emelyanov	db1e29cfea	replica: Mark database::make_keyspace_config() private It's not been used outside of database class for long ago already Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-26 09:56:07 +03:00
Pavel Emelyanov	d7018ae3d9	replica: Prepare full keyspace config in make_keyspace_config() Currently for system keyspace part of config members are configured outside of this helper, in the caller. It's more consistent to have full config initialization in one place. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-26 09:54:12 +03:00
Pavel Emelyanov	eff61b167c	treewide: Reduce db/config.hh header fanout Drop it from files that obviously don't need it. Also kill some forward declarations while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#22979	2025-02-25 15:16:40 +01:00
Piotr Dulikowski	43ae3ab703	test: test_mv_topology_change: increase timeout for removenode The test `test_mv_topology_change` is a regression test for scylladb/scylladb#19529. The problem was that CL=ANY writes issued when all replicas were down would be kept in memory until the timeout. In particular, MV updates are CL=ANY writes and have a 5 minute timeout. When doing topology operations for vnodes or when migrating tablet replicas, the cluster goes through stages where the replica sets for writes undergo changes, and the writes started with the old replica set need to be drained first. Because of the aforementioned MV updates, the removenode operation could be delayed by 5 minutes or more. Therefore, the `test_mv_topology_change` test uses a short timeout for the removenode operation, i.e. 30s. Apparently, this is too low for the debug mode and the test has been observed to time out even though the removenode operation is progressing fine. Increase the timeout to 60s. This is the lowest timeout for the removenode operation that we currently use among the in-repo tests, and is lower than 5 minutes so the test will still serve its purpose. Fixes: scylladb/scylladb#22953 Closes scylladb/scylladb#22958	2025-02-25 17:00:36 +03:00
Evgeniy Naydanov	e572771f36	test.py: refactor test.py: move test suites classes into pylib Split huge test.py into smaller pieces: test.pylib.suite.* Closes scylladb/scylladb#23005	2025-02-25 14:35:29 +03:00
Avi Kivity	6e70e69246	test/lib: mutation_assertions: deinline While generally better to reduce inline code, here we get rid of the clustering_interval_set.hh dependency, which in turns depends on boost interval_set, a large dependency. incremental_compaction_test.cc is adjusted for a missing header. Closes scylladb/scylladb#22957	2025-02-25 11:40:54 +01:00
Calle Wilund	e49f2046e5	generic_server: Update conditions for is_broken_pipe_or_connection_reset Refs scylla-enterprise#5185 Fixes #22901 If a tls socket gets EPIPE the error is not translated to a specific gnutls error code, but only a generic ERROR_PULL/PUSH. Since we treat EPIPE as ignorable for plain sockets, we need to unwind nested exception here to detect that the error was in fact due to this, so we can suppress log output for this. Closes scylladb/scylladb#22888	2025-02-25 10:35:11 +02:00
Kefu Chai	9fdbe0e74b	tree: Remove unused boost headers This commit eliminates unused boost header includes from the tree. Removing these unnecessary includes reduces dependencies on the external Boost.Adapters library, leading to faster compile times and a slightly cleaner codebase. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22997	2025-02-25 10:32:32 +03:00
Kefu Chai	42335baec5	backup_task: Use INFO level for upload abort during shutdown When a backup upload is aborted due to instance shutdown, change the log level from ERROR to INFO since this is expected behavior. Previously, `abort_requested_exception` during upload would trigger an ERROR log, causing test failures since error logs indicate unexpected issues. This change: - Catches `abort_requested_exception` specifically during file uploads - Logs these shutdown-triggered aborts at INFO level instead of ERROR - Aligns with how `abort_requested_exception` is handled elsewhere in the service This prevents false test failures while still informing administrators about aborted uploads during shutdown. Fixes scylladb/scylladb#22391 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22995	2025-02-25 10:32:10 +03:00
Benny Halevy	55dbf5493c	docs: document the views-with-tablets experimental feature Refs scylladb/scylladb#22217 Fixes scylladb/scylladb#22893 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#22896	2025-02-24 17:23:08 +01:00
Avi Kivity	d99df7af6c	Merge 'Respect per-shard tablet goal and 10x default per-shard tablet count' from Tomasz Grabiec This series achieves two things: 1) changes default number of tablet replicas per shard to be 10 in order to reduce load imbalance between shards This will result in new tables having at least 10 tablet replicas per shard by default. We want this to reduce tablet load imbalance due to differences in tablet count per shard, where some shards have 1 tablet and some shards have 2 tablets. With higher tablet count per shard, this difference-by-one is less relevant. Fixes https://github.com/scylladb/scylladb/issues/21967 2) introduces a global goal for tablet replica count per shard and adds logic to tablet scheduler to respect it by controlling per-table tablet count The per-shard goal is enforced by controlling average per-shard tablet replica count in a given DC, which is controlled by per-table tablet count. This is effective in respecting the limit on individual shards as long as tablet replicas are distributed evenly between shards. There is no attempt to move tablets around in order to enforce limits on individual shards in case of imbalance between shards. If the average per-shard tablet count exceeds the limit, all tables which contribute to it (have replicas in the DC) are scaled down by the same factor. Due to rounding up to the nearest power of 2, we may overshoot the per-shard goal by at most a factor of 2. The scaling is applied after computing desired tablet count due to all other factors: per-table tablet count hints, defaults, average tablet size. If different DCs want different scale factors of a given table, the lowest scale factor is chosen for a given table. When creating a new table, its tablet count is determined by tablet scheduler using the scheduler logic, as if the table was already created. So any scaling due to per-shard tablet count goal is reflected immediately when creating a table. It may however still take some time for the system to shrink existing tables. We don't reject requests to create new tables. Fixes #21458 Closes scylladb/scylladb#22522 * github.com:scylladb/scylladb: config, tablets: Allow tablets_initial_scale_factor to be a fraction test: tablets_test: Test scaling when creating lots of tables test: tablets_test: Test tablet count changes on per-table option and config changes test: tablets_test: Add support for auto-split mode test: cql_test_env: Expose db config config: Make tablets_initial_scale_factor live-updateable tablets: load_balancer: Pick initial_scale_factor from config tablets, load_balancer: Fix and improve logging of resize decisions tablets, load_balancer: Log reason for target tablet count tablets: load_balancer: Move hints processing to tablet scheduler tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table tablets: load_balancer: Determine desired count from size separately from count from options tablets: load_balancer: Determine resize decision from target tablet count tablets: load_balancer: Allow splits even if table stats not available tablets: load_balancer: Extract make_sizing_plan() tablets: Add formatter for resize_decision::way_type tablets: load_balancer: Simplify resize_urgency_cmp() tablets: load_balancer: Keep config items as instance members locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology() tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard tablets: Set default initial tablet count scale to 10 tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology() tablets: load_balancer: Extract get_schema_and_rs() tablets: load_balancer: Drop test_mode	2025-02-24 17:59:26 +02:00
Łukasz Paszkowski	9ec1a457d6	alter_keyspace_statement: Include tablets information in system.topology Altering a keyspace (that has tablets enabled) without changing tablets attributes, i.e. no `AND tablets = {...}` results in incorrect "Update Keyspace..." log message being printed. The printed log contains "tablets={"enabled":false}". Refs https://github.com/scylladb/scylladb/issues/22261 Closes scylladb/scylladb#22324	2025-02-24 15:11:14 +02:00
Botond Dénes	6ae3076b4e	Merge 'tablet-mon.py: Improve split&merge visualization and make tablet id text optional in table mode' from Tomasz Grabiec Tablet sequeunce number was part of the tablet identifier together with last token, so on split and merge all ids changed and it appeared in the simulator as all tablets of a table dropping and being created anew. That's confusing. After this change, only last token is part of the id, so split appears as adding tablets and merge appears as removing half the tablets, which is more accurate. Also includes an enhancement to make showing of tablet id text optional in table mode. Closes scylladb/scylladb#22981 * github.com:scylladb/scylladb: tablet-mon.py: Don't show merges and splits as full table recreations tablet-mon.py: Add toggle for tablet ids	2025-02-24 15:09:54 +02:00
Takuya ASADA	f2a8ae101b	dist/docker: drop hostname package, use Python API We currently depends on hostname command to get local IP, but we can do this on Python API. After the change, we can drop the package. Closes scylladb/scylladb#22909	2025-02-24 15:03:44 +02:00
Anna Stuchlik	d0a48c5661	doc: remove the reference to the 6.2 version This commit removes the OSS version name, which is irrelevant and confusing for 2025.1 and later users. Also, it updates the warning to avoid specifying the release when the deprecated feature will be removed. Fixes https://github.com/scylladb/scylladb/issues/22839 Closes scylladb/scylladb#22936	2025-02-24 15:02:11 +02:00
Botond Dénes	6ab16006a2	Merge 'Untangle sstable-directory vs sstable in pending log creation code' from Pavel Emelyanov There's a sstable_directory::create_pending_deletion_log() helper method that's called by sstable's filesystem_storage atomic-delete methods and that prepares the deletion log for a bunch of sstables. For that method to do its job it needs to get private sstable->_storage field (which is always the filesystem_storage one), also the atomic-delete transparent context object is leaked into the sstable_directory code and low-level sstable storage code needs to include higher-level sstable_directory header. This patch unties these knots. As the result: - friendship between sstable and sstable_directory is removed - transparent atomic_delete_context is encapsulated in storage.(cc\|hh) code - less code for create_pending_deletion_log() to dump TOC filename into log Closes scylladb/scylladb#22823 * github.com:scylladb/scylladb: sstable: Unfriend sstable_directory class sstable_directory: Move sstable_directory::pending_delete_result sstable_directory: Calculate prefixes outside of create_pending_deletion_log() sstable_directory: Introduce local pending_delete_log variable sstable_directory: Relax toc file dumping to deletion log	2025-02-24 14:58:37 +02:00
Paweł Zakrzewski	854d2917a1	cql3/select_statement: reject PER PARTITION LIMIT with SELECT DISTINCT Before this patch we silently allowed and ignored PER PARTITION LIMIT. SELECT DISTINCT requires all the partition key columns, which means that setting PER PARTITION LIMIT is redundant - only one result will be returned from every partition anyway. Cassandra behaves the same way, so this patch also ensures compatibility. Fixes scylladb/scylladb#15109 Closes scylladb/scylladb#22950	2025-02-24 14:50:18 +02:00
Yaron Kaikov	e6227f9a25	install-dependencies.sh: update node_exporter to 1.9.0 Update node_exporter to 1.9.0 to resolve the following CVE's https://github.com/advisories/GHSA-49gw-vxvf-fc2g https://github.com/advisories/GHSA-8xfx-rj4p-23jm https://github.com/advisories/GHSA-crqm-pwhx-j97f https://github.com/advisories/GHSA-j7vj-rw65-4v26 Fixes: https://github.com/scylladb/scylladb/issues/22884 regenerate frozen toolchain with optimized clang from * https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz * https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz Closes scylladb/scylladb#22987	2025-02-24 13:49:36 +02:00
Avi Kivity	1891e10b7b	sstables: writer.hh: drop unneeded boost depedencies Closes scylladb/scylladb#22955	2025-02-24 13:26:44 +03:00
Avi Kivity	58d4d8142a	install-dependencies.sh: harden pip_packages against shellcheck pip_packages is an associative array, which in bash is constructed as ([key]=value...). In our case the value is often empty (indicating no version constraint). Shellcheck warns against it, since `[key]= x` could be a mistype of `[key]=x`. It's not in our case, but shellcheck doesn't know that. Make shellcheck happier by specifying the empty values explicitly. Closes scylladb/scylladb#22990	2025-02-24 13:26:10 +03:00
Kefu Chai	dfa40972bb	topology_custom/test_zero_token_nodes_multidc: Enhance test logging and error handling Add verbose logging to identify failing test combinations in multi-DC setup: - Log replication factor (RF) and consistency level (CL) for each test iteration - Add validation checks for empty result sets Improve error handling: - Before indexing in a list, use `assert` to check for its emptiness - Use assertion failures instead of exceptions for clearer test diagnostics This change helps debug test failures by showing which RF/CL combinations cause inconsistent results between zero-token and regular nodes. Refs scylladb/scylladb#22967 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22968	2025-02-24 11:09:51 +01:00
Kefu Chai	7bf7817e8a	docs/cql: s/wasm32-wasi/wasm32-wasip1/ Rust's WASI target of wasm32-wasi was renamed to wasm32-wasip1, see https://blog.rust-lang.org/2024/04/09/updates-to-rusts-wasi-targets.html. and our building system has been adapted to this change. let's update the document to reflect this change. Fixes scylladb/scylladb#20878 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21184	2025-02-24 11:06:46 +01:00
Patryk Jędrzejczak	de751cad03	Merge 'test/topology_experimental_raft: add test_topology_upgrade_stuck' from Piotr Dulikowski The test simulates the cluster getting stuck during upgrade to raft topology due to majority loss, and then verifies that it's possible to get out of the situation by performing recovery and redoing the upgrade. Fixes: #17410 Closes scylladb/scylladb#17675 * https://github.com/scylladb/scylladb: test/topology_experimental_raft: add test_topology_upgrade_stuck test.py: bump minimum python version to 3.11 test.py: move gather_safely to pylib utils cdc: generation: don't capture token metadata when retrying update test.py: topology: ignore hosts when waiting for group0 consistency raft: add error injection that drops append_entries topology_coordinator: add injection which makes upgrade get stuck	2025-02-24 11:02:32 +01:00
Kefu Chai	d92646a17e	install.sh: simplify check_usermode_support() because we don't care about the exact output of grep, let's silence its output. also, no need to check for the string is empty, so let's just use the status code of the grep for the return value of the function, more idiomatic this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22737	2025-02-24 11:29:30 +03:00
Evgeniy Naydanov	99be9ac8d8	test.py: test_random_failures: improve handling of hung node In some cases the paused/unpaused node can hang not after 30s timeout. This make the test flaky. Change the condition to always check the coordinator's log if there is a hung node. Add `stop_after_streaming` to the list of error injections which can cause a node's hang. Also add a wait for a new coordinator election in cluster events which cause such elections. Closes scylladb/scylladb#22825	2025-02-24 10:23:05 +03:00
Kefu Chai	fd52b0a3cc	cql3: fix false-positive "used-after-move" warning in clang-tidy `slice.is_reversed()` was falsely flagged as accessing moved data, since the underlying enum_set remains valid after move. However, to improve code clarity and silence the warning, now reference `command->slice` directly instead, which is guaranteed to be valid as the move target. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22971	2025-02-23 18:58:35 +02:00
Marcin Maliszkiewicz	f34ea308b3	transport: remove unused _request_cpu from connection	2025-02-23 18:32:14 +02:00
Benny Halevy	7a4c563e40	feed_writers: optimize error path Eliminate one try/catch block around call to wr.close() by using coroutine::as_future. Mark error paths as `[[unlikely]]`. Use `coroutine::return_exception_ptr` to avoid rethrowing the final exception. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#22831	2025-02-23 18:22:39 +02:00
Dawid Mędrek	138645f744	install-dependencies.sh: Make script capable of updating pip packages Before these changes, the script didn't update the listed pip packages if they were already installed. If the latest version of Scylla started using new features and required an updated Python driver, for example, the developers (and possibly the user) were forced to update it manually. In this commit, we modify the script so that it updates the installed packages when run. This should make things easier for everyone. Closes scylladb/scylladb#22912	2025-02-23 16:26:50 +02:00
Yaron Kaikov	084f4d2ee3	.github/scripts/auto-backport.py: search for `Fixes` also in commits In #22650 the backport process wasn't completed since the PR body didn't include the Fixes ref as expected but the commits did have it Expanding the search for `Fixes` to include commits in the same PR Fixes: https://github.com/scylladb/scylla-pkg/issues/4899 Closes scylladb/scylladb#22988	2025-02-23 13:20:28 +02:00
Pavel Emelyanov	a6c882e4e3	sstables: Remove dead get_config() and db::config declarations Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#22974	2025-02-21 15:56:04 +01:00
Tomasz Grabiec	62d53d2a47	tablet-mon.py: Don't show merges and splits as full table recreations Tablet sequeunce number was part of the tablet identifier together with last token, so on split and merge all ids changed and it appeared in the simulator as all tablets of a table dropping and being created anew. That's confusing. After this change, only last token is part of the id, so split appears as adding tablets and merge appears as removing half the tablets, which is more accurate.	2025-02-21 15:34:48 +01:00
Tomasz Grabiec	7227d70d4d	tablet-mon.py: Add toggle for tablet ids	2025-02-21 15:34:48 +01:00
Kefu Chai	a80d7e6159	test/pylib: test/pylib: Simplify boolean logic in pagination check Replace complex boolean expression: ```py not driver_response_future.has_more_pages or not all_pages ``` with clearer equivalent: ```py driver_response_future.has_more_pages and all_pages ``` The new expression is more intuitive as it directly checks for both conditions (having more pages and wanting all pages) rather than using double negation. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22969	2025-02-21 14:21:09 +03:00
Emil Maskovsky	574224491d	raft/test: adjust the "raft_ignore_nodes" test for limited voters Before the limited voters feature, the "raft_ignore_nodes" test was relying upon the fact that all nodes will become voters. With the limited voters feature, the test needs to be adjusted to ensure that we do not lose the majority of the cluster. This could happen when there are 7 nodes, but only 5 of them are voters - then if we kill 3 nodes randomly we might end up with only 2 voters left. Therefore we need to ensure that we only stop the appropriate number of voter nodes. So we need to determine which nodes became voters and which ones are non-voters, and select the nodes to be stopped based on that. That means with 7 nodes and 5 voters, we can stop up to 2 voter nodes, but at least one of the stopped nodes must be a non-voter. Fixes: scylladb/scylladb#22902 Refs: scylladb/scylladb#18793 Refs: scylladb/scylladb#21969 Closes scylladb/scylladb#22904	2025-02-20 18:42:03 +01:00
Patryk Jędrzejczak	6bb1ed2ef4	Merge 'Merge topology_tasks and topology_random_failures into topology_custom' from Artsiom Mishuta Now that we support suite subfolders, there is no need to create an own suite for topology_tasks and topology_random_failures. Closes scylladb/scylladb#22879 * https://github.com/scylladb/scylladb: test.py: merge topology_tasks suite into topology_custom suite test.py: merge topology_random_failures suite into topology_customs	2025-02-20 16:02:45 +01:00
Patryk Jędrzejczak	78c227c521	Merge 'raft topology: Add support for raft topology init to happen before group0 initialization' from Abhinav Kumar Jha In the current scenario, the problem discovered is that there is a time gap between group0 creation and raft_initialize_discovery_leader call. Because of that, the group0 snapshot/apply entry enters wrong values from the disk(null) and updates the in-memory variables to wrong values. During the above time gap, the in-memory variables have wrong values and perform absurd actions. This PR removes the variable `_manage_topology_change_kind_from_group0` which was used earlier as a work around for correctly handling `topology_change_kind` variable, it was brittle and had some bugs (causing issues like scylladb/scylladb#21114). The reason for this bug that _manage_topology_change_kind used to block reading from disk and was enabled after group0 initialization and starting raft server for the restart case. Similarly, it was hard to manage `topology_change_kind` using `_manage_topology_change_kind_from_group0` correctly in bug free manner. Post `_manage_topology_change_kind_from_group0` removal, careful management of `topology_change_kind` variable was needed for maintaining correct `topology_change_kind` in all scenarios. So this PR also performs a refactoring to populate all init data to system tables even before group0 creation(via `raft_initialize_discovery_leader` function). Now because `raft_initialize_discovery_leader` happens before the group 0 creation, we write mutations directly to system tables instead of a group 0 command. Hence, post group0 creation, the node can read the correct values from system tables and correct values are maintained throughout. Added a new function `initialize_done_topology_upgrade_state` which takes care of updating the correct upgrade state to system tables before starting group0 server. This ensures that the node can read the correct values from system tables and correct values are maintained throughout. By moving `raft_initialize_discovery_leader` logic to happen before starting group0 server, and not as group0 command post server start, we also get rid of the potential problem of init group0 command not being the 1st command on the server. Hence ensuring full integrity as expected by programmer. This PR fixes a bug. Hence we need to backport it. Fixes: scylladb/scylladb#21114 Closes scylladb/scylladb#22484 * https://github.com/scylladb/scylladb: storage_service: Remove the variable _manage_topology_change_kind_from_group0 storage_service: fix indentation after the previous commit raft topology: Add support for raft topology system tables initialization to happen before group0 initialization service/raft: Refactor mutation writing helper functions.	2025-02-20 14:42:39 +01:00
Benny Halevy	29b795709b	token_group_based_splitting_mutation_writer: maybe_switch_to_new_writer: prevent double close Currently, maybe_switch_to_new_writer resets _current_writer only in a continuation after closing the current writer. This leaves a window of vulnerability if close() yields, and token_group_based_splitting_mutation_writer::close() is called. Seeing the engaged _current_writer, close() will call _current_writer->close() - which must be called exactly once. Solve this when switching to a new writer by resetting _current_writer before closing it and potentially yielding. Fixes #22715 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#22922	2025-02-20 15:41:09 +03:00
Kefu Chai	ccbfe4f669	compaction: replace boost::range::find with std::ranges::find Replace boost::range::find() calls with std::ranges::find(). This change reduces external dependencies and modernizes the codebase. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22942	2025-02-20 14:25:08 +02:00
Anna Stuchlik	a28bbc22bd	doc: remove references to Enterprise This commit removes the redundant references to Enterprise, which are no longer valid. Fixes https://github.com/scylladb/scylladb/issues/22927 Closes scylladb/scylladb#22930	2025-02-20 11:24:34 +02:00
Raphael S. Carvalho	4d8a333a7f	storage_service: Don't retry split when table is dropped The split monitor wasn't handling the scenario where the table being split is dropped. The monitor would be unable to find the tablet map of such a table, and the error would be treated as a retryable one causing the monitor to fall into an endless retry loop, with sleeps in between. And that would block further splits, since the monitor would be busy with the retries. The fix is about detecting table was dropped and skipping to the next candidate, if any. Fixes #21859. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#22933	2025-02-20 10:13:55 +01:00
Gleb Natapov	914c9f1711	treewide: include build_mode.hh for SCYLLA_BUILD_MODE_RELEASE where it is missing Fixes: #22914 Closes scylladb/scylladb#22915	2025-02-20 10:50:04 +03:00
Botond Dénes	1f553457dc	Merge 'test/topology: use standard new_test_keyspace functions' from Benny Halevy This PR improves and refactors the test.topology.util new_test_keyspace generator and adds a corresponding create_new_test_keyspace function to be used by most if not all topology unit tests in order to standardize the way the tests create keyspaces and to mitigate the python driver create keyspace retry issue: https://github.com/scylladb/python-driver/issues/317 Fixes #22342 Fixes #21905 Refs https://github.com/scylladb/scylla-enterprise/issues/5060 * No backport required, though may be desired to stabilize CI also in release branches. Closes scylladb/scylladb#22399 * github.com:scylladb/scylladb: test_tablet_repair_scheduler: prepare_multi_dc_repair: use create_new_test_keyspace test/repair: create_table_insert_data_for_repair: create keyspace with unique name topology_tasks/test_tablet_tasks: use new_test_keyspace topology_tasks/test_node_ops_tasks: use new_test_keyspace topology_custom/test_zero_token_nodes_no_replication: use create_new_test_keyspace topology_custom/test_zero_token_nodes_multidc: use create_new_test_keyspace topology_custom/test_view_build_status: use new_test_keyspace topology_custom/test_truncate_with_tablets: use new_test_keyspace topology_custom/test_topology_failure_recovery: use new_test_keyspace topology_custom/test_tablets_removenode: use create_new_test_keyspace topology_custom/test_tablets_migration: use new_test_keyspace topology_custom/test_tablets_merge: use new_test_keyspace topology_custom/test_tablets_intranode: use new_test_keyspace topology_custom/test_tablets_cql: use new_test_keyspace topology_custom/test_tablets2: use *new_test_keyspace topology_custom/test_tablets2: test_schema_change_during_cleanup: drop unused check function topology_custom/test_tablets: use new_test_keyspace topology_custom/test_table_desc_read_barrier: use new_test_keyspace topology_custom/test_shutdown_hang: use new_test_keyspace topology_custom/test_select_from_mutation_fragments: use new_test_keyspace topology_custom/test_rpc_compression: use new_test_keyspace topology_custom/test_reversed_queries_during_simulated_upgrade_process: use new_test_keyspace topology_custom/test_raft_snapshot_truncation: use create_new_test_keyspace topology_custom/test_raft_no_quorum: use new_test_keyspace topology_custom/test_raft_fix_broken_snapshot: use new_test_keyspace topology_custom/test_query_rebounce: use new_test_keyspace topology_custom/test_not_enough_token_owners: use new_test_keyspace topology_custom/test_node_shutdown_waits_for_pending_requests: use new_test_keyspace topology_custom/test_node_isolation: use create_new_test_keyspace topology_custom/test_mv_topology_change: use new_test_keyspace topology_custom/test_mv_tablets_replace: use new_test_keyspace topology_custom/test_mv_tablets_empty_ip: use new_test_keyspace topology_custom/test_mv_tablets: use new_test_keyspace topology_custom/test_mv_read_concurrency: use new_test_keyspace topology_custom/test_mv_fail_building: use new_test_keyspace topology_custom/test_mv_delete_partitions: use new_test_keyspace topology_custom/test_mv_building: use new_test_keyspace topology_custom/test_mv_backlog: use new_test_keyspace topology_custom/test_mv_admission_control: use new_test_keyspace topology_custom/test_major_compaction: use new_test_keyspace topology_custom/test_maintenance_mode: use new_test_keyspace topology_custom/test_lwt_semaphore: use new_test_keyspace topology_custom/test_ip_mappings: use new_test_keyspace topology_custom/test_hints: use new_test_keyspace topology_custom/test_group0_schema_versioning: use new_test_keyspace topology_custom/test_data_resurrection_after_cleanup: use new_test_keyspace topology_custom/test_read_repair_with_conflicting_hash_keys: use new_test_keyspace topology_custom/test_read_repair: use new_test_keyspace topology_custom/test_compacting_reader_tombstone_gc_with_data_in_memtable: use new_test_keyspace topology_custom/test_commitlog_segment_data_resurrection: use new_test_keyspace topology_custom/test_change_replication_factor_1_to_0: use new_test_keyspace topology/test_tls: test_upgrade_to_ssl: use new_test_keyspace test/topology/util: new_test_keyspace: drop keyspace only on success test/topology/util: refactor new_test_keyspace test/topology/util: CREATE KEYSPACE IF NOT EXISTS test/topology/util: new_test_keyspace: accept ManagerClient	2025-02-20 09:43:15 +02:00
Kefu Chai	ddfd438434	cql3: replace boost::accumulate() with std::ranges::fold_left() Replace boost::accumulate() calls with std::ranges::fold_left(). This change reduces external dependencies and modernizes the codebase. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22924	2025-02-20 09:32:17 +03:00
Kefu Chai	5be39740a8	tree: migrate from boost::find to std::ranges algorithms Replace boost::find() calls with std::ranges::find() and std::ranges::contains() to leverage modern C++ standard library features. This change reduces external dependencies and modernizes the codebase. The following changes were made: - Replaced boost::find() with std::ranges::find() where index/iterator is needed - Used std::ranges::contains() for simple element presence checks Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22920	2025-02-20 09:28:57 +03:00
Tomasz Grabiec	1a7023c85a	config, tablets: Allow tablets_initial_scale_factor to be a fraction We may want fewer than 1 tablets per shard in large clusters. The per-table option is a fraction, so for consistency, this should be too.	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	2b2fa0203e	test: tablets_test: Test scaling when creating lots of tables	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	0e111990a1	test: tablets_test: Test tablet count changes on per-table option and config changes	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	5e471c6f1b	test: tablets_test: Add support for auto-split mode rebalance_tablets() was performing migrations and merges automatically but not splits, because splits need to be acked by replicas via load_stats. It's inconvenient in tests which want to rebalance to the equilibrium point. This patch changes rebalance_tablets() to split automatically by default, can be disabled for tests which expect differently. shared_load_stats was introduced to provide a stable holder of load_stats which can be reused across rebalance_tablets() calls.	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	f3b63bfeff	test: cql_test_env: Expose db config	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	3d01ce3707	config: Make tablets_initial_scale_factor live-updateable	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	7e4a61953d	tablets: load_balancer: Pick initial_scale_factor from config So that it can be live-updated.	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	41789962ef	tablets, load_balancer: Fix and improve logging of resize decisions Resize is no longer only due to avg tablet size. Log avg tablet size as an information, not the reason, and log the true reason for target tablet count.	2025-02-19 16:29:07 +01:00
Tomasz Grabiec	d1ccbee7f9	tablets, load_balancer: Log reason for target tablet count Helps in debugging.	2025-02-19 16:29:07 +01:00
Tomasz Grabiec	029505b179	tablets: load_balancer: Move hints processing to tablet scheduler Hints have common meaning for all strategies, so the logic belongs more to make_sizing_plan(). As a side effect, we can reuse shard capacity computation across tables, which reduces computational complexity from O(tablesnodes) to O(tables DCs + nodes)	2025-02-19 16:29:07 +01:00
Tomasz Grabiec	f1bda8d4c1	tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal The limit is enforced by controlling average per-shard tablet replica count in a given DC, which is controlled by per-table tablet count. This is effective in respecting the limit on individual shards as long as tablet replicas are distributed evenly between shards. There is no attempt to move tablets around in order to enforce limits on individual shards in case of imbalance between shards. If the average per-shard tablet count exceeds the limit, all tables which contribute to it (have replicas in the DC) are scaled down by the same factor. Due to rounding up to the nearest power of 2, we may overshoot the per-shard goal by at most a factor of 2. If different DCs want different scale factors of a given table, the lowest scale factor is chosen for a given table. The limit is configurable. It's a global per-cluster config which controls how many tablet replicas per shard in total we consider to be still ok. It controls tablet allocator behavior, when choosing initial tablet count. Even though it's a per-node config, we don't support different limits per node. All nodes must have the same value of that config. It's similar in that regard to other scheduler config items like tablets_initial_scale_factor and target_tablet_size_in_bytes.	2025-02-19 16:29:07 +01:00
Tomasz Grabiec	94b5165ac7	tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table This makes decisions made by the scheduler consistent with decisions made on table creation, with regard to tablet count. We want to avoid over-allocation of tablets when table is created, which would then be reduced by the scheduler's scaling logic. Not just to avoid wasteful migrations post table creation, but to respect the per-shard goal. To respect the per-shard goal, the algorithm will no longer be as simple as looking at hints, and we want to share the algorithm between the scheduler and initial tablet allocator. So invoke the scheduler to get the tablet count when table is created.	2025-02-19 14:40:07 +01:00
Tomasz Grabiec	dd68c1e526	tablets: load_balancer: Determine desired count from size separately from count from options For debugging purposes. Later we will want to know which rule determined the count.	2025-02-19 14:40:07 +01:00
Tomasz Grabiec	e4c5e2ab55	tablets: load_balancer: Determine resize decision from target tablet count The flow is simpler this way, since the decision cannot now be mismatched with target tablet count.	2025-02-19 14:40:07 +01:00
Tomasz Grabiec	35192e2d6f	tablets: load_balancer: Allow splits even if table stats not available This is in preparation for using the sizing plan during table creation where we never have size stats, and hints are the only determining factor for target tablet count.	2025-02-19 14:40:07 +01:00
Tomasz Grabiec	d3ffea77e6	tablets: load_balancer: Extract make_sizing_plan() Resize plan making will now happen in two stages: 1) Determine desired tablet counts per table (sizing plan) 2) Schedule resize decisions We need intermediate step in the resize plan making, which gives us the planned tablet counts, so that we can plug this part of the algorithm into initial tablet allocation on table construction. We want decisisons made by the scheduler to be consistent with decisions made on table creation. We want to avoid over-allocation of tablets when table is created, which would then be reduced by the scheduler. Not just to avoid wasteful migrations post table creation, but to respect the per-shard goal. To respect the per-shard goal, the algorithm will no longer be as simple as looking at hints, and we want to share the algorithm between the scheduler and initial tablet allocator. Also, this sizing plan will be later plugged into a virtual table for observability.	2025-02-19 14:40:06 +01:00
Tomasz Grabiec	33db0d4fea	tablets: Add formatter for resize_decision::way_type	2025-02-19 14:39:40 +01:00
Tomasz Grabiec	b7e5919fdd	tablets: load_balancer: Simplify resize_urgency_cmp() Logic is preserved since target tablet size is constant for all tables. Dropping d.target_max_tablet_size() will allow us to move it to the load_balancer scope.	2025-02-19 14:39:40 +01:00
Tomasz Grabiec	997007a2df	tablets: load_balancer: Keep config items as instance members It fits preexisting pattern for other config items, and makes the code less cluttered because we don't have to carry config items across calls.	2025-02-19 14:39:39 +01:00
Tomasz Grabiec	ce959818a3	locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology()	2025-02-19 14:38:50 +01:00
Tomasz Grabiec	f043c83ba5	tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard Currently the scale is applied post rounding up of tablet count so that tablet count per shard is at least 1. In order to be able to use the scale to increase tablet count per shard, we need to apply it prior to division by RF, otherwise we will overshoot per-shard tablet replica count. Example: 4 nodes, -c1, rf=3, initial_tablets_scale=10 Before: initial_tablet_count=20, tablet-per-shard=15 After: initial_tablet_count=14, tablets-per-shard=10.5	2025-02-19 14:38:50 +01:00
Tomasz Grabiec	2463e524ed	tablets: Set default initial tablet count scale to 10 This will result in new tables having at least 10 tablet replicas per shard by default. We want this to reduce tablet load imbalance due to differences in tablet count per shard, where some shards have 1 tablet and some shards have 2 tablets. With higher tablet count per shard, this difference-by-one is less relevant. Fixes #21967 In some tests, we explicity set the initial scale to 1 as some of the existing tests assume 1 compaction group per shard. test.py uses a lower default. Having many tablets per shard slows down certain topology operations like decommission/replace/removenode, where the running time is proportional to tablet count, not data size, because constant cost (latency) of migration dominates. This latency is due to group0 operations and barriers. This is especially pronounced in debug mode. Scheduler allows at most 2 migrations per shard, so this latency becomes a determining factor for decommission speed. To avoid this problem in tests, we use lower default for tablet count per shard, 2 in debug/dev mode and 4 in release mode. Alternatively, we could compensate by allowing more concurrency when migrating small tablets, but there's no infrastructure for that yet. I observed that with 10 tablets per shard, debug-mode topology_custom.mv/test_mv_topology_change starts to time-out during removenode (30 s).	2025-02-19 14:38:50 +01:00
Tomasz Grabiec	8eedb551b5	tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology() To insert preemption points later.	2025-02-19 14:38:49 +01:00
Tomasz Grabiec	eef18d879c	tablets: load_balancer: Extract get_schema_and_rs() For better readability.	2025-02-19 14:38:49 +01:00
Tomasz Grabiec	9d600dd783	tablets: load_balancer: Drop test_mode tablets_test is now creating proper schema in the database, so test_mode is no longer needed.	2025-02-19 14:38:48 +01:00
yangpeiyu2_yewu	0de232934a	mutation_writer/multishard_writer.cc: wrap writer into futurize_invoke wrapped writer in seastar::futurize_invoke to make sure that the close() for the mutation_reader can be executed before destruction. Fixes #22790 Closes scylladb/scylladb#22812	2025-02-19 13:00:45 +02:00
Pavel Emelyanov	d79eec2e76	sstable: Unfriend sstable_directory class It was only needed there for create_pending_deletion_log() method to get private "_storage" from sstable. Now it's all gone and friendship can be broken. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-19 13:09:04 +03:00
Pavel Emelyanov	96a867c869	sstable_directory: Move sstable_directory::pending_delete_result ... to where it belongs -- to the filesystem storage driver itself. Continuation of the previous patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-19 13:09:04 +03:00
Pavel Emelyanov	f6de6d6887	sstable_directory: Calculate prefixes outside of create_pending_deletion_log() The method in question walks the list of sstables and accumulates sstables' prefixes into a set on pending_delete_result object. The set in question is not used at all in this method and is in fact alien to it -- the p.d._result object is used by the filesystem storage driver as atomic deletion prepare/commit transparent context. Said that, move the whole pending_delete_result to where it belongs and relax the create_pending_deletion_log() to only return the log directory path string. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-19 13:09:04 +03:00
Pavel Emelyanov	b0c1a77528	sstable_directory: Introduce local pending_delete_log variable This is simply to reduce the churn in the next patch, nothing special here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-19 13:09:04 +03:00
Pavel Emelyanov	5b92c4549e	sstable_directory: Relax toc file dumping to deletion log The current code takes sstable prefix() (e.g. the /foo/bar string), then trims from its fron the basedir (e.g. the /foo/ string) and then writes the remainder, a slash and TOC component name (e.g. the xxx-TOC.txt string). The final result is "bar/xxx-TOC.txt" string. The taking into account sstable.toc_filename() renders into sstable.prefix + \slash + component-name, the above result can be achieved by trimming basedir directory from toc_filename(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-19 13:09:04 +03:00
Botond Dénes	820f196a49	replica/database: setup_scylla_memory_diagnostics_producer() un-static semaphore dump lambda The lambda which dumps the diagnostics for each semaphore, is static. Considering that said lambda captures a local (writeln) by reference, this is wrong on two levels: * The writeln captured on the shard which happens to initialize this static, will be used on all shards. * The writeln captured on the first dump, will be used on later dumps, possibly triggering a segfault. Drop the `static` to make the lambda local and resolve this problem. Fixes: scylladb/scylladb#22756 Closes scylladb/scylladb#22776	2025-02-19 12:22:16 +03:00
Nadav Har'El	a7bf36831c	test: remove spammy deprecation warnings Recently, when running Alternator tests we get hundreds of warnings like the following from basically all test files: /usr/lib/python3.12/site-packages/botocore/crt/auth.py:59: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC). /usr/local/lib/python3.12/site-packages/pytest_elk_reporter.py:299: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC). These warnings all come from two libraries that we use in the tests - botocore is used by Alternator tests, and elk reporter is a plugin that we don't actually use, but it is installed by dtest and we often see it in our runs as well. These warnings have zero interest to us - not only do we not care if botocore uses some deprecated Python APIs and will need to be updated in the future, all these warnings are hiding real warnings about deprecated things we actually use in our own test code. The patch modifies test/pytest.ini (used by all our Python tests, including but not limited to Alternator tests) to ignore deprecation warnings from inside these two libraries, botocore and elk_reporter. After this patch, test/alternator/run finishes without any warnings at all. test/cqlpy does still have a few warnings left, which earlier were hidden by the thousands of spammy warning eliminated in this patch. We fix one of these warnings in this patch: ResultSet indexing support will be removed in 4.0. Consider using ResultSet.one() by doing exactly what the warning recommended. Some deprecation warnings in test/cqlpy remain in calls to get_query_trace(). The "blame" for these warning is misplaced - this function is part of the cassandra driver, but Python seems to think it's part of our test code so I can't avoid them with the pytest.ini trick, I'm not sure why. So I don't know yet how to eliminate these last warnings. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#22881	2025-02-19 12:15:51 +03:00
Avi Kivity	45b2026209	service: raft: drop unused dependency from group0_state_machine_merger.hh Reduces dependency load. Closes scylladb/scylladb#22781	2025-02-19 12:14:58 +03:00
Kefu Chai	d1f117620a	build: restrict -Xclang options to Clang compiler only Modify CMake configuration to only apply "-Xclang" options when building with the Clang compiler. These options are Clang-specific and can cause errors or warnings when used with other compilers like g++. This change: - Adds compiler detection to conditionally apply Clang-specific flags - Prevents build failures when using non-Clang compilers Previously, the build system would apply these flags universally, which could lead to compilation errors with other compilers. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22899	2025-02-19 12:13:35 +03:00
Kefu Chai	d384b0a63e	utils: use std::to_underlying() when appropriate Use std::to_underlying() when comparing unsigned types with enumeration values to fix type mismatch warnings in GCC-14. This specifically addresses an issue in utils/advanced_rpc_compressor.hh where comparing a uint8_t with 0 triggered a '-Werror=type-limits' warning: ``` error: comparison is always false due to limited range of data type [-Werror=type-limits] if (x < 0 \|\| x >= static_cast<underlying>(type::COUNT)) ~~^~~ ``` Using std::to_underlying() provides clearer type semantics and avoids these kind of comparison warnings. This change improves code readability while maintaining the same behavior. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22898	2025-02-19 12:12:28 +03:00
Benny Halevy	cc281ff88d	test_tablet_repair_scheduler: prepare_multi_dc_repair: use create_new_test_keyspace and return the keyspace unique name to the caller. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 09:35:33 +02:00
Aleksandra Martyniuk	f8e4198e72	service: tasks: hold token_metadata_ptr in tablet_virtual_task Hold token_metadata_ptr in tablet_virtual_task methods that iterate over tablets, to keep the tablet_map alive. Fixes: https://github.com/scylladb/scylladb/issues/22316. Closes scylladb/scylladb#22740	2025-02-19 09:33:53 +02:00
Dusan Malusev	4e6ea232d2	docs: add instruction for installing cassandra-stress Signed-off-by: Dusan Malusev <dusan.malusev@scylladb.com> Closes scylladb/scylladb#21723	2025-02-19 09:25:16 +02:00
Benny Halevy	cbe79b20f7	test/repair: create_table_insert_data_for_repair: create keyspace with unique name and return it to the caller Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:56:07 +02:00
Benny Halevy	9829b1594f	topology_tasks/test_tablet_tasks: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:52:59 +02:00
Benny Halevy	12f85ce57c	topology_tasks/test_node_ops_tasks: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:52:59 +02:00
Benny Halevy	0564e95c51	topology_custom/test_zero_token_nodes_no_replication: use create_new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:52:59 +02:00
Benny Halevy	46b1850f0c	topology_custom/test_zero_token_nodes_multidc: use create_new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:52:59 +02:00
Benny Halevy	b810791fbb	topology_custom/test_view_build_status: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:52:59 +02:00
Benny Halevy	2d4af01281	topology_custom/test_truncate_with_tablets: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:52:58 +02:00
Benny Halevy	16ef78075c	topology_custom/test_topology_failure_recovery: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	96d327fb83	topology_custom/test_tablets_removenode: use create_new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	f30e4c6917	topology_custom/test_tablets_migration: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	20f7eda16e	topology_custom/test_tablets_merge: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	5ff3153912	topology_custom/test_tablets_intranode: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	e59aca66bf	topology_custom/test_tablets_cql: use new_test_keyspace And create_new_test_keyspace when we need drop to be explicit. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	6b37d04aa9	topology_custom/test_tablets2: use *new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	0b88ea9798	topology_custom/test_tablets2: test_schema_change_during_cleanup: drop unused check function Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	649e68c6db	topology_custom/test_tablets: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	005ceb77d3	topology_custom/test_table_desc_read_barrier: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	50a8f5c1c0	topology_custom/test_shutdown_hang: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	4fd6c2d24e	topology_custom/test_select_from_mutation_fragments: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	72bc4016e7	topology_custom/test_rpc_compression: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	47326d01b7	topology_custom/test_reversed_queries_during_simulated_upgrade_process: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	e72a9d3faa	topology_custom/test_raft_snapshot_truncation: use create_new_test_keyspace Using the new_test_keyspace fixture is awkward for this test as it is written to explicitly drop the created keyspaces at certain points. Therefore, just use create_new_test_keyspace to standardize the creation procedure. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	3f35491264	topology_custom/test_raft_no_quorum: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	380c5e5ac8	topology_custom/test_raft_fix_broken_snapshot: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	e05372afa4	topology_custom/test_query_rebounce: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	c68d2a471c	topology_custom/test_not_enough_token_owners: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	5759a97eb4	topology_custom/test_node_shutdown_waits_for_pending_requests: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	55b35eb21c	topology_custom/test_node_isolation: use create_new_test_keyspace new_test_keyspace is problematic here since the presence of the banned node can fail the automatic drop of the test keyspace due to NoHostAvailable (in debug mode for some reason) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	ff9c8428df	topology_custom/test_mv_topology_change: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	11005b10db	topology_custom/test_mv_tablets_replace: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	966cf82dae	topology_custom/test_mv_tablets_empty_ip: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	c05794c156	topology_custom/test_mv_tablets: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	d5e3c578f5	topology_custom/test_mv_read_concurrency: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	42a104038d	topology_custom/test_mv_fail_building: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	629ee3cb46	topology_custom/test_mv_delete_partitions: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	a82e734110	topology_custom/test_mv_building: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	b13e48b648	topology_custom/test_mv_backlog: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	ef85c4b27e	topology_custom/test_mv_admission_control: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	0e11aad9c5	topology_custom/test_major_compaction: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	0668c642a2	topology_custom/test_maintenance_mode: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	9c095b622b	topology_custom/test_lwt_semaphore: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	c6653e65ba	topology_custom/test_ip_mappings: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	fed078a38a	topology_custom/test_hints: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	480a5837ab	topology_custom/test_group0_schema_versioning: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	4fefffe335	topology_custom/test_data_resurrection_after_cleanup: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	57faab9ffa	topology_custom/test_read_repair_with_conflicting_hash_keys: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	205ed113dd	topology_custom/test_read_repair: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	fdb339bf28	topology_custom/test_compacting_reader_tombstone_gc_with_data_in_memtable: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	59687c25e0	topology_custom/test_commitlog_segment_data_resurrection: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	df84097a4b	topology_custom/test_change_replication_factor_1_to_0: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	a66ddb7c04	topology/test_tls: test_upgrade_to_ssl: use new_test_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	0fd1b846fe	test/topology/util: new_test_keyspace: drop keyspace only on success When the test fails with exception, keep the keyspace intact for post-mortem analysis. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	f946302369	test/topology/util: refactor new_test_keyspace Define create_new_test_keyspace that can be used in cases we cannot automatically drop the newly created keyspace due to e.g. loss of raft majority at the end of the test. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	5d448f721e	test/topology/util: CREATE KEYSPACE IF NOT EXISTS Workaround spurious keyspace creation errors due to retries caused by https://github.com/scylladb/python-driver/issues/317. This is safe since the function uses a unique_name for the keyspace so it should never exist by mistake. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:35 +02:00
Benny Halevy	50ce0aaf1c	test/topology/util: new_test_keyspace: accept ManagerClient Following patch will convert topology tests to use new_test_keyspace and friends. Some tests restart server and reset the driver connection so we cannot use the original cql Session for dropping the created keyspace in the `finally` block. Pass the ManagerClient instead to get a new cql session for dropping the keyspace. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-19 08:43:26 +02:00
Kefu Chai	727d5637ab	cql3: remove redundant std::move() in select_statement.cc GCC-14 correctly flagged unnecessary use of std::move() where copy elision applies: ``` return std::move(paging_state_copy); ``` This error occurs in indexed_table_select_statement::generate_view_paging_state_from_base_query_results at line 1122. The C++17 standard guarantees copy elision for returning local variables, making std::move() redundant in this context and potentially hindering compiler optimizations. Fixes build failure with GCC-14 which treats redundant moves as errors with -Werror=redundant-move. The error message looks like: ``` /usr/lib64/ccache/g++ -DDEVEL -DSCYLLA_BUILD_MODE=dev -DSCYLLA_ENABLE_ERROR_INJECTION -DSCYLLA_ENABLE_PREEMPTION_SOURCE -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Dev\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/build/rust -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/Dev/seastar/gen/include -isystem /home/kefu/dev/scylladb/abseil -I/usr/include/p11-kit-1 -O2 -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unused-parameter -Wno-changes-meaning -Wno-ignored-attributes -Wno-dangling-pointer -Wno-array-bounds -Wno-narrowing -Wno-type-limits -ffile-prefix-map=/home/kefu/dev/scylladb/= -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -ffile-prefix-map=/home/kefu/dev/scylladb/build/=build -march=westmere -Wstack-usage=21504 -std=gnu++23 -Wno-maybe-uninitialized -Werror=unused-result -fstack-clash-protection -DSEASTAR_P2581R1 -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=19 -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_THREAD_NO_LIB -DBOOST_THREAD_DYN_LINK -DFMT_SHARED -MD -MT cql3/CMakeFiles/cql3.dir/Dev/statements/select_statement.cc.o -MF cql3/CMakeFiles/cql3.dir/Dev/statements/select_statement.cc.o.d -o cql3/CMakeFiles/cql3.dir/Dev/statements/select_statement.cc.o -c /home/kefu/dev/scylladb/cql3/statements/select_statement.cc /home/kefu/dev/scylladb/cql3/statements/select_statement.cc: In member function ‘seastar::lw_shared_ptr<const service::pager::paging_state> cql3::statements::indexed_table_select_statement::generate_view_paging_state_from_base_query_results(seastar::lw_shared_ptr<const service::pager::paging_state>, const seastar::foreign_ptr<seastar::lw_shared_ptr<query::result> >&, service::query_state&, const cql3::query_options&) const’: /home/kefu/dev/scylladb/cql3/statements/select_statement.cc:1122:21: error: redundant move in return statement [-Werror=redundant-move] 1122 \| return std::move(paging_state_copy); \| ~~~~~~~~~^~~~~~~~~~~~~~~~~~~ ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22903	2025-02-18 21:12:58 +02:00
Tomasz Grabiec	22386a6ceb	Merge 'truncate: don't fail on already waiting truncate for the same table' from Ferenc Szili Currently, we can not have more than one global topology operation at the same time. This means that we can not have concurrent truncate operations because truncate is implemented as a global topology operation. Truncate excludes with other topology operations, and has to wait for those to complete before truncate starts executing. This can lead to truncate timeouts. In these cases the client retries the truncate operation, which will check for ongoing global topology operations, and will fail with an "Another global topology request is ongoing, please retry." error. This can be avoided by truncate checking if the ongoing global topology operation is a truncate running for the same table who's truncate has just been requested again. In this case, we can wait for the ongoing truncate to complete instead of immediately failing the operation, and provide a better user experience. This is an improvement, backport is not needed. Closes #22166 Closes scylladb/scylladb#22371 * github.com:scylladb/scylladb: test: add test for re-cycling ongoing truncate operations truncate: add additional logging and improve error message during truncate storage_proxy: wait on already running truncate for the same table storage_proxy: allow multiple truncate table fibers per shard	2025-02-18 15:54:00 +01:00
Lakshmi Narayanan Sreethar	0f7d08d41d	topology_coordinator: handle_table_migration: do not continue after executing metadata barrier Return after executing the global metadata barrier to allow the topology handler to handle any transitions that might have started by a concurrect transaction. Fixes #22792 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#22793	2025-02-18 15:48:45 +01:00
Botond Dénes	2e062e4e10	docs/operating-scylla: document scylla-sstable query	2025-02-18 07:37:05 -05:00
Botond Dénes	ddab1b939b	test/cqlpy/test_tools.py: add tests for scylla-sstable query	2025-02-18 07:37:05 -05:00
Artsiom Mishuta	3c3a23637a	test.py: merge topology_tasks suite into topology_custom suite Now that we support suite subfolders, there is no need to create an own suite for tasks.	2025-02-18 13:15:31 +01:00
Artsiom Mishuta	dbdd0dd844	test.py: merge topology_random_failures suite into topology_customs Now that we support suite subfolders, there is no need to create an own suite for random_failures	2025-02-18 13:15:24 +01:00
Botond Dénes	3928851ab0	Merge 'encryption_at_rest_test/encryption: Add some verbosity etc to help diagnose test run issues' from Calle Wilund Refs #22628 Adds exception handler + cleanup for the case where we have a bad config/env vars (hint minio) or similar, such that we fail with exception during setting up the EAR context. In a normal startup, this is ok. We will report the exception, and the do a exit(1). In tests however, we don't and active context will instead be freed quite proper, in which case we need to call stop to ensure we don't crash on shared pointer destruction on wrong shard. Doing so will hide the real issue from whomever runs the test. Adds some verbosity to track issues with the network proxy used to test EAR connector difficulties. Also adds an earlier close in input stream to help network usage. Note: This is a diagnostic helper. Still cannot repro the issue above. Closes scylladb/scylladb#22810 * github.com:scylladb/scylladb: gcp/aws kms: Promote service_error to recoverable + use malformed_response_error encryption_at_rest_test: Add verbosity + earlier stream close to proxy encryption: Add exception handler to context init (for tests)	2025-02-18 10:29:30 +02:00
Kefu Chai	9c5155fa63	compaction: switch from boost::accumulate to std::views::join Replace boost::accumulate() with the standard library's alternatives to reduce external dependencies and simplify the codebase. This change eliminates the requirement for boost::range and makes the implementation more maintainable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22856	2025-02-18 10:23:40 +02:00
Botond Dénes	aba4d07c62	tools/utils: configure_tool_mode: set auto_handle_sigint_sigterm = false Disable seastar's built in handlers for SIGINT and SIGTERM and thus fall-back to the OS's default handlers, which terminate the process. This makes tool applications interruptable by SIGINT and SIGTERM. The default handler just terminates the tool app immediately and doesn't allow for cleanup, but this is fine: the tools have no important data to save or any critical cleanup to do before exiting. Fixes: scylladb/scylladb#16954 Closes scylladb/scylladb#22838	2025-02-17 23:28:18 +02:00
Avi Kivity	30a38e61d4	Merge 'sstables_manager: trigger reclaim/reload on `components_memory_reclaim_threshold` update' from Lakshmi Narayanan Sreethar The config variable `components_memory_reclaim_threshold` limits the memory available to the sstable bloom filters. Any change to its value is not immediately propagated to the sstable manager, despite it being a LiveUpdate variable. The updated value takes effect only when a new sstable is created or deleted. This PR first refactors the reclaim and reload logic into a single background fiber. It then updates the sstable manager to subscribe to changes in the `components_memory_reclaim_threshold` configuration value and immediately triggers the reclaim/reload fiber when a change is detected. Fixes #21947 This is an improvement and does not need to be backported. Closes scylladb/scylladb#22725 * github.com:scylladb/scylladb: sstables_manager: trigger reclaim/reload on `components_memory_reclaim_threshold` update sstables_manager: maybe_reclaim_components: yield between iterations sstables_manager: rename `increment_total_reclaimable_memory_and_maybe_reclaim()` sstables_manager: move reclaim logic into `components_reclaim_reload_fiber()` sstables_manager: rename `_sstable_deleted_event` condition variable sstables_manager: rename `components_reloader_fiber()` sstables_manager: fix `maybe_reclaim_components()` indentation sstables_manager: reclaim components memory until usage falls below threshold sstables_manager: introduce `get_components_memory_reclaim_threshold()` sstables_manager: extract `maybe_reclaim_components()` sstables_manager: fix `maybe_reload_components()` indentation sstables_manager: extract out `maybe_reload_components()`	2025-02-17 22:33:33 +02:00
Lakshmi Narayanan Sreethar	064bf2fd85	sstables_manager: trigger reclaim/reload on `components_memory_reclaim_threshold` update The config variable `components_memory_reclaim_threshold` limits the memory available to the sstable bloom filters. Any change to its value is not immediately propagated to the sstable manager, despite it being a LiveUpdate variable. The updated value takes effect only when a new sstable is created or deleted. This patch updates the sstable manager to subscribe to any changes in the above mentioned config value and immediately trigger the reclaim/reload fiber when a change occurs. Also, adds a testcase to verify the fix. Fixes #21947 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-02-17 20:55:45 +05:30
Calle Wilund	00263aa57a	gcp/aws kms: Promote service_error to recoverable + use malformed_response_error Refs #22628 Mark problems parsing response (partial message, network error without exception etc - hello testing), as "malformed_response_error", and promote this as well as general "service_error" to recoverable exceptions (don't isolate node on error). This to better handle intermittent network issues as well as making error-testing more deterministic.	2025-02-17 13:49:43 +00:00
Calle Wilund	5905c19ab4	encryption_at_rest_test: Add verbosity + earlier stream close to proxy Refs #22628 Adds some verbosity to track issues with the network proxy used to test EAR connector difficulties. Also adds an earlier close in input stream to help network usage. Note: This is a diagnostic helper. Still cannot repro the issue above.	2025-02-17 13:49:43 +00:00
Calle Wilund	83aa66da1a	encryption: Add exception handler to context init (for tests) Adds exception handler + cleanup for the case where we have a bad config/env vars (hint minio) or similar, such that we fail with exception during setting up the EAR context. In a normal startup, this is ok. We will report the exception, and the do a exit(1). In tests however, we don't and active context will instead be freed quite proper, in which case we need to call stop to ensure we don't crash on shared pointer destruction on wrong shard. Doing so will hide the real issue from whomever runs the test.	2025-02-17 13:49:42 +00:00
Piotr Dulikowski	35df6bb6b2	Merge 'raft_rpc::send_append_entries: limit memory usage' from Petr Gusev Serializing `raft::append_request` for transmission requires approximately the same amount of memory as its size. This means when the Raft library replicates a log item to M servers, the log item is effectively copied M times. To prevent excessive memory usage and potential out-of-memory issues, we limit the total memory consumption of in-flight `raft::append_request` messages. Fixes scylladb/scylladb#14411 Closes scylladb/scylladb#22835 * github.com:scylladb/scylladb: raft_rpc::send_append_entries: limit memory usage fms: extract entry_size to log_entry::get_size	2025-02-17 14:11:12 +01:00
Botond Dénes	a32b4d20cf	test/cqlpy/test_tools.py: make scylla_sstable() return table name also Not used by current users, will be needed by next patch.	2025-02-17 08:01:39 -05:00
Botond Dénes	5d09182ce5	scylla-sstable: introduce the query command Allows querying the content of sstables. Simple queries can be constructed on the command-line. More advanced queries can be passed in a file. The output can be text (similar to CQLSH) or json (similar to SELECT JSON). Uses a cql_test_env behind the scenes to set-up a query pipeline. The queried sstables are not registered into cql_test_env, instead they are queried via the virtual-table interface. This is to isolate the sstables from any accidental modifications cql_test_env might want to do to them.	2025-02-17 08:01:39 -05:00
Botond Dénes	5e76dd90a9	tools/utils: get_selected_operation(): use std::string for operation_options tool_app_template::run() calls get_selected_operation() to obtain the operation (command) the user selected. To do this, get_selected_operation() does a CLI pre-parsing pass, with a minimal boost::program_options, so things like mixed positional/non-positional args are correctly handled. This code use `sstring` for generic operation-options. The problem is that boost doesn't allow values with spaces inside for non-std::string types. This therefore prevents such values from being used for any option downstream, because parsing would fail at this stage. Change the type to std::string to solve this problem.	2025-02-17 08:01:39 -05:00
Botond Dénes	a6caade11d	utils/rjson: streaming_writer: add RawValue() Exposes the RawValue() method of the underlying rapidjson::Writer. This method allows writing a pre-formatted json value to the stream. This will allow using cql3/type_json.hh to pre-format CQL3 types, then write these pre-formatted values into a json stream.	2025-02-17 08:01:38 -05:00
Botond Dénes	c917ee0638	cql3/type_json: add to_json_type() Translate a CQL value of a CQL type into the appropriate rjson::type.	2025-02-17 08:01:38 -05:00
Botond Dénes	01a4d30d88	test/lib/cql_test_env: introduce do_with_cql_env_noreentrant_in_thread() This variant of do_with_cql_env(), forgoes the reentrancy support in the regular do_with_cql_env() variants, and re-uses the caller's exsting seastar thread. This is an optimized version for callers which don't need reentrancy and already have a thread.	2025-02-17 08:01:38 -05:00
Piotr Dulikowski	da2237417c	test/topology_experimental_raft: add test_topology_upgrade_stuck The test simulates the cluster getting stuck during upgrade to raft topology due to majority loss, and then verifies that it's possible to get out of the situation by performing recovery and redoing the upgrade. Fixes: scylladb/scylladb#17410	2025-02-17 13:12:53 +01:00
Piotr Dulikowski	a2f5e6ab0a	test.py: bump minimum python version to 3.11 Python 3.11 introduces asyncio.TaskGroup, which I would like to use in a test that I'll introduce in the next commit. Modify the python version check in test.py to prevent from accidentally running with an older version of python.	2025-02-17 13:12:49 +01:00
Piotr Dulikowski	1b6fb95efc	test.py: move gather_safely to pylib utils The gather_safely function was originally defined in the test.pylib.scylla_cluster module, but it is a generic concurrency combinator which is not tied to the concept of Scylla clusters at all. Move it to test.pylib.util to make this fact more clear.	2025-02-17 12:47:13 +01:00
Piotr Dulikowski	56ae119b19	cdc: generation: don't capture token metadata when retrying update In legacy topology mode, on startup, a node will attempt to insert data of the newest CDC generation into the legacy distributed tables. In case of any errors, the operation will be retried until success in 60s intervals. While the node waits for the operation to be retried, it keeps a token_metadata_ptr instance. This is a problem for two reasons: - The tmptr instance is used in a lambda which determines the cluster size. This lambda is used to determine the consistency level when inserting the generation to the distributed tables - if there is only one node, CL=ONE should be used instead of CL=QUORUM. The tmptr is immutable so it can technically happen the the cluster is shrinked while the code waits for the generation to be inserted. - Token metadata instance keeps a version tracker that which prevents topology operations from proceeding while the tracker exists. This is a very niche problem, but it might happen that a leftover instance of token metadata held by update_streams_description might delay a topology operation which happens after upgrade to raft topology happens. This actually slows down the test which simulates upgrade to raft topology getting stuck (to be introduced in later commits). Instead of capturing a token_metadata_ptr instance, capture a reference to shared_token_metadata and use a freshly issued token_metadata_ptr when computing the cluster size in order to choose the consistency level.	2025-02-17 12:28:53 +01:00
Piotr Dulikowski	d75888460d	test.py: topology: ignore hosts when waiting for group0 consistency Now, check_system_topology_and_cdc_generations_v3_consistency has an additional list argument and will ignore hosts from that list if some of them are found to be in the "left" state. Additionally, the function now requires that the set of the live hosts in the cluster is exactly `live_hosts` - no more, no less. It will be needed for the test which simulates upgrade procedure getting stuck - "un-stucking" the procedure requires removing some nodes via legacy removenode procedure which marks them as "left" in gossip, and then those nodes might get inserted as "left" nodes into raft topology by the gossiper orphan remover fiber. Some of the existing tests had to be adjusted because of the changes: - test_unpublished_cdc_generations_arent_cleared passed only one of the cluster's live hosts, now it passes all of them. - test_topology_recovery_after_majority_loss removes some nodes during the test, so they need to be put into the ignore_nodes list. - test_topology_upgrade_basic did not include the last-added node to the check_system_topology_and_cdc_generations_v3_consistency call, now it does.	2025-02-17 12:28:52 +01:00
Piotr Dulikowski	f112d76422	raft: add error injection that drops append_entries It will be needed for a test that simulates the cluster getting stuck during upgrade. Specifically, it will be used to simulate network isolation and to prevent raft commands from reaching that node.	2025-02-17 12:28:52 +01:00
Piotr Dulikowski	cd1a336885	topology_coordinator: add injection which makes upgrade get stuck The injection will necessary for the test, introduced in the next commit, which verifies that it's possible to recover from an upgrade of raft topology which gets stuck.	2025-02-17 12:28:52 +01:00
Kefu Chai	3cf0f71420	query-result-writer: reorder initialization to prevent use-after-move Reorder member variable initialization sequence to ensure `pw` is accessed before being moved. While the current use-after-move warning from clang-tidy is a false positive, this change: - Makes the initialization order more logical - Eliminates misleading static analysis warnings - Prevents potential future issues if class structure changes Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22830	2025-02-17 13:45:35 +03:00
Abhi	d7884cf651	storage_service: Remove the variable _manage_topology_change_kind_from_group0 This commit removes the variable _manage_topology_change_kind_from_group0 which was used earlier as a work around for correctly handling topology_change_kind variable, it was brittle and had some bugs. Earlier commits made some modifications to deal with handling topology_change_kind variable post _manage_topology_change_kind_from_group0 removal	2025-02-17 15:19:39 +05:30
Abhi	623e01344b	storage_service: fix indentation after the previous commit	2025-02-17 15:06:27 +05:30
Nadav Har'El	5693c18637	test/cqlpy, alternator: allow downloading 2025 releases This patch adds to the fetch_scylla.py script, used by the "--release" option of test/{cqlpy,alternator}/run, the ability to download the new 2025.1 releases. In the new single-stream releases, the number looks like the old Scylla Enterprise releases, but the location of the artifacts in the S3 bucket look like the old open-source releases (without the word "-enterprise" in the paths). So this patch introduces a new "if" for the (major >= 2025) case. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#22778	2025-02-17 12:30:42 +03:00
Ferenc Szili	8f8c5c5e24	test: add test for re-cycling ongoing truncate operations This change adds a test for truncate waiting for already queued truncate operation for the same table.	2025-02-17 10:18:29 +01:00
Ferenc Szili	af3fb1941a	truncate: add additional logging and improve error message during truncate This change adds two log messages. One for the creation of the truncate global topology request, and another for the truncate timeout. This is added in order to help with tracking truncate operation events. It also extends the "Another global topology request is ongoing, please retry." error message with more information: keyspace and table name.	2025-02-17 10:18:29 +01:00
Ferenc Szili	e87768c5a0	storage_proxy: wait on already running truncate for the same table Currently, we can not have more than one global topology operation at the same time. This means that we can not have concurrent truncate operations because truncate is implemented as a global topology operation. Truncate excludes with other topology operations, and has to wait for those to complete before truncate starts executing. This can lead to truncate timeouts. In these cases the client retries the truncate operation, which will check for ongoing global topology operations, and will fail with an "Another global topology request is ongoing, please retry." error. This can be avoided by truncate checking if we have a truncate for the same table already queued. In this case, we can wait for the ongoing truncate to complete instead of immediatelly failing the operation, and provide a better user experience.	2025-02-17 10:18:20 +01:00
Piotr Dulikowski	e4d574fdbb	Merge 'Fix view-builder vs (repair and streaming) initialization order' from Pavel Emelyanov Both, repair and streaming depend on view builder, but since the builder is started too late, both keep sharded<> reference on it and apply `if (view_builder.local_is_initialized())` safety checks. However, view builder can do its sharded start much earlier, there's currently nothing that prevents it from doing so. This PR moves view builder start up together with some other of its dependencies, and relaxes the way repair and streaming use their view-builder references, in particular -- removes those ugly initialization checks. refs: scylladb/scylladb#2737 Closes scylladb/scylladb#22676 * github.com:scylladb/scylladb: streaming: Relax streaming::make_streamig_consumer() view builder arg streaming: Keep non-sharded view_builder dependency reference streaming: Remove view_builder.local_is_initialized() checks repair: Keep non-sharded view_builder dependency reference repair: Remove view_builder.local_is_initialized() checks main: Start sharded<view_builder> earlier test/cql_env: Move stream manager start lower	2025-02-17 10:03:28 +01:00
Kefu Chai	2ed465e70a	install.sh: address shellcheck warnings Replace legacy shell test operator (-o) with more portable OR (\|\|) syntax. Fix fragile file handling in find loop by using while read loop instead. Warnings fixed: - SC2166: Replace [ p -o q ] with [ p ] \|\| [ q ] - SC2044: Replace for loop over find with while read loop While no issues were observed with the current code, these changes improve robustness and portability across different shell environments. also, set the pipefail option, so that we can catch the unexpected failure of `find` command call. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22385	2025-02-17 12:01:51 +03:00
Pavel Emelyanov	ac989f7c30	api: Remove get_uuid() local helper This helper now fully duplicates the validate_table() one, so it can be removed. Two callers are updated respectively. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-17 11:42:33 +03:00
Pavel Emelyanov	a4cbc4db55	api: Make use of validate_table()'s table_id There are several places that validate_table() and then call database::find_column_family(ks, cf) which goes and repeats the search done by validate_table() before that. To remove the unneeded work, re-use the table_id found by validate_table() helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-17 11:42:33 +03:00
Pavel Emelyanov	e698259557	api: Make validate_table() helper return table_id after validation This helper calls database::find_column_family() and ignores the result. The intention of this is just to check if the c.f. in question exists. The find_column_family() in turn calls find_uuid() and then finds the c.f. object using the uuid found. The latter search is not supposed to fail, if it does, the on_internal_error() is called. Said that, replacing find_column_family() with find_uuid() is idempotent. And returning the found table_id will be used by next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-17 11:42:32 +03:00
Pavel Emelyanov	1991512826	api: Change validate_table()'s ctx argument to database This is to be in-sync with another get_uuid() helper from API. This, in turn, is to ease the unification of those two, because they are effectively identical (see next patches) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-17 11:42:32 +03:00
Botond Dénes	b87f5a0b58	reader_concurrency_semaphore: remove reduntant inactive_read::ttl_timer It is redundant with reader_permit::impl::_ttl_timer. Use the latter for TTL of inactive reads too. The usage of the two exclude each other, at any point in time, either one or the other is used, so no reason to keep both. Closes scylladb/scylladb#22863	2025-02-17 11:41:16 +03:00
Botond Dénes	15126e4c9f	reader_concurrency_semaphore: use std::ranges::for_each() Instead of boost::for_each(). Closes scylladb/scylladb#22862	2025-02-17 11:35:32 +03:00
Avi Kivity	b7f804659b	clustering_range_walker: drop boost iterator_range dependency Reduces dependency load. Closes scylladb/scylladb#22880	2025-02-17 11:34:46 +03:00
Avi Kivity	03ae67f9ea	tablets: load_balancer: don't log decisions to do nothing Demote do-nothing decisions to debug level, but keep them at info if we did decide to do nothing (such as migrate a tablet). Information about more major events (like split/merge) is kept at info level. Once log line that logs node information now also logs the datacenter, which was previously supplied by a log line that is now debug-only. Closes scylladb/scylladb#22783	2025-02-17 11:34:27 +03:00
Botond Dénes	3439d015cb	Merge 'repair: Introduce Host and DC filter support' from Aleksandra Martyniuk Currently, the tablet repair scheduler repairs all replicas of a tablet. It does not support hosts or DCs selection. It should be enough for most cases. However, users might still want to limit the repair to certain hosts or DCs in production. https://github.com/scylladb/scylladb/pull/21985 added the preparation work to add the config options for the selection. This patch adds the hosts or DCs selection support. Fixes https://github.com/scylladb/scylladb/issues/22417 New feature. No backport is needed. Closes scylladb/scylladb#22621 * github.com:scylladb/scylladb: test: add test to check dcs and hosts repair filter test: add repair dc selection to test_tablet_metadata_persistence repair: Introduce Host and DC filter support docs: locator: update the docs and formatter of tablet_task_info	2025-02-17 10:04:09 +02:00
Kefu Chai	aa8c27b872	db: prevent accidental copies of result_set_row by making it move-only result_set_row is a heavyweight object containing multiple cell types: regular columns, partition keys, and static values. To prevent expensive accidental copies, delete the copy constructor and replace it with: 1. A move constructor for efficient vector reallocation 2. An explicit copy() method when copies are actually needed This change reduces overhead in some non-hot paths by eliminating implicit deep copies. Please note, previously, in `create_view_from_mutation()`, we kept a copy of `result_set_row`, and then reused `table_rs` for holding the mutation for `scylla_tables`. Because we don't copy the `result_set_row` in this change, in order to avoid invalidating the `row` after reusing `table_rs` in the outer scope, we define a new `table_rs` shadowing the one in the out scope. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22741	2025-02-17 09:48:08 +02:00
Botond Dénes	57a06a4c35	Merge 'Enhance s3 client perf test with "uploading" facility and related tunables' from Pavel Emelyanov The existing test measures latencies of object GET-s. That's nice (though incomplete), but we want to measure upload performance. Here it is. refs: #22460 Closes scylladb/scylladb#22480 * github.com:scylladb/scylladb: test/perf/s3: Add --part-size-mb option for upload test test/perf/s3: Add uploading test test/perf/s3: Some renames not to be download-centric test/perf/s3: Make object/file name configurable test/perf/s3: Configure maximum number of sockets test/perf/s3: Remove parallelizm s3/client: Make http client connections limit configurable	2025-02-17 09:46:11 +02:00
Avi Kivity	81821d26cd	cql3: functions: add set_intersection() Given two sets of equivalent types, return the set intersection. This is a generic function which adapts to the actual input type. A unit test is added. Closes scylladb/scylladb#22763	2025-02-16 14:06:29 +02:00
Nadav Har'El	4a2654865d	Merge 'test.py: suport subfolders' from Artsiom Mishuta this PR is propper(pythonic) chance of commit `288a47f815` Creating an own folder used to be needed for two reasons: we want a separate test suite, with its own settings we want to structure tests, e.g. tablets, raft, schema, gossip. We've been creating many folders recently. However, test suite infrastructure is expensive in test.py - each suite has its own pool of servers, concurrency settings and so on. Make it possible to structure tests without too many suites, by supporting subfolders within a suite. As an example, this PR move mv tests into a separate folder custom test.py lookup also works. tests can be run as: 1. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv/tablets/test_mv_tablets_empty_ip 2. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv/tablets 3. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv Fixes https://github.com/scylladb/scylladb/issues/20570 Closes scylladb/scylladb#22816 * github.com:scylladb/scylladb: test.py: move mv tests into a separate folder test.py: suport subfolders	2025-02-16 12:36:25 +02:00
Andrei Chekun	17992c0456	Remove tox Seems tox is not used anywhere, so there is no need to have it then. Especially when it messes with pytest. In some cases it can change the config dir in pytest run. Closes scylladb/scylladb#22819	2025-02-16 12:23:55 +02:00
Kefu Chai	34517b09a2	alternator,streaming: fix comment typos Fix misspellings in comments identified by the codespell tool. fix typos in comment Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22829	2025-02-16 11:34:44 +02:00
Piotr Szymaniak	c1f186c98a	alternator: re-enabling/changing existing stream's StreamViewType as well as disabling the nonexistent stream Table updates that try to enable stream (while changing or not the StreamViewType) on a table that already has the stream enabled will result in ValidationError. Table updates that try to disable stream on a table that does not have the stream enabled will result in ValidationError. Add two tests to verify the above. Mark the test for changing the existing stream's StreamViewType not to xfail. Fixes scylladb/scylladb#6939 Closes scylladb/scylladb#22827	2025-02-16 09:57:49 +02:00
Jenkins Promoter	0d5f5e6c9d	Update pgo profiles - x86_64	2025-02-15 20:32:23 +02:00
Jenkins Promoter	9daf50d424	Update pgo profiles - aarch64	2025-02-15 20:32:22 +02:00
Lakshmi Narayanan Sreethar	a145a2f83a	scylla-gdb: scylla_read_stats: access schema via schema_ptr class Switch to using schema_ptr wrapper when handling schema references in scylla_read_stats function. The existing fallback for older versions (where schema is already a raw pointer) remains preserved. Fixes #18700 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#22726	2025-02-15 20:32:22 +02:00
Calle Wilund	342df0b1a8	network_topology_strategy/alter ks: Remove dc:s from options once rf=0 Fixes #22688 If we set a dc rf to zero, the options map will still retain a dc=0 entry. If this dc is decommissioned, any further alters of keyspace will fail, because the union of new/old options will now contained an unknown keyword. Change alter ks options processing to simply remove any dc with rf=0 on alter, and treat this as an implicit dc=0 in nw-topo strategy. This means we change the reallocate_tablets routine to not rely on the strategy objects dc mapping, but the full replica topology info for dc:s to consider for reallocation. Since we verify the input on attribute processing, the amount of rf/tablets moved should still be legal. v2: * Update docs as well. v3: * Simplify dc processing * Reintroduce options empty check, but do early in ks_prop_defs * Clean up unit test some Closes scylladb/scylladb#22693	2025-02-15 20:32:22 +02:00
Nadav Har'El	f89235517d	test/topology_custom: fix very slow test test_localnodes_broadcast_rpc_address The test topology_custom/test_alternator::test_localnodes_broadcast_rpc_address sets up nodes with a silly "broadcast rpc address" and checks that Alternator's "/localnodes" requests returns it correctly. The problem is that although we don't use CQL in this test, the test framework does open a CQL connection when the test starts, and closes it when it ends. It turns out that when we set a silly "broadcast RPC address", the driver tends to try to connect to it when shutting down, I'm not even sure why. But the choice of the silly address was 1.2.3.4 is unfortunate, because this IP address is actually routable - and the driver hangs until it times out (in practice, in a bit over two minutes). This trivial patch changes 1.2.3.4 to 127.0.0.0 - and equally silly address but one to which connections fail immediately. Before this patch, the test often takes more than 2 minutes to finish on my laptop, after this patch, it always finishes in 4-5 seconds. Fixes #22744 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#22746	2025-02-15 20:32:22 +02:00
Botond Dénes	87e8e00de6	tools/scylla-nodetool: netstats: don't assume both senders and receivers The code currently assumes that a session has both sender and receiver streams, but it is possible to have just one or the other. Change the test to include this scenario and remove this assumption from the code. Fixes: #22770 Closes scylladb/scylladb#22771	2025-02-15 20:32:22 +02:00
Pavel Emelyanov	1b44861e8f	Merge 'sstable_loader: fix cross-shard resource cleanup in download_task_impl ' from Kefu Chai This PR addresses two related issues in our task system: 1. Prepares for asynchronous resource cleanup by converting release_resources() to a coroutine. This refactoring enables future improvements in how we handle resource cleanup. 2. Fixes a cross-shard resource cleanup issue in the SSTable loader where destruction of per-shard progress elements could trigger "shared_ptr accessed on non-owner cpu" errors in multi-shard environments. The fix uses coroutines to ensure resources are released on their owner shards. Fixes #22759 --- this change addresses a regression introduced by `d815d7013c`, which is contained by 2025.1 and master branches. so it should be backported to 2025.1 branch. Closes scylladb/scylladb#22791 * github.com:scylladb/scylladb: sstable_loader: fix cross-shard resource cleanup in download_task_impl tasks: make release_resources() a coroutine	2025-02-15 20:32:22 +02:00
Kefu Chai	7ff0d7ba98	tree: Remove unused boost headers This commit eliminates unused boost header includes from the tree. Removing these unnecessary includes reduces dependencies on the external Boost.Adapters library, leading to faster compile times and a slightly cleaner codebase. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22857	2025-02-15 20:32:22 +02:00
Raphael S. Carvalho	d78f57e94a	service: Don't use new tablet_resize_finalization state until supported In a rolling upgrade, nodes that weren't upgraded yet will not recognize the new tablet_resize_finalization state, that serves both split and merges, leading to a crash. To fix that, coordinator will pick the old tablet_split_finalization state for serving split finalization, until the cluster agrees on merge, so it can start using the new generic state for resize finalization introduced in merge series. Regression was introduced in `e00798f`. Fixes #22840. Reported-by: Tomasz Grabiec <tgrabiec@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#22845	2025-02-15 20:32:22 +02:00
Li Bo	de8de50fb9	Remove redundant code in mutation_partition.cc Use the defined `cdef` variable. Closes scylladb/scylladb#22048	2025-02-15 20:32:22 +02:00
Nadav Har'El	26fa234f87	test/cqlpy,alternator: "--release" should not require AWS credentials The script fetch_scylla.py is used by the "--release" option of test/cqlpy/run and test/alternator/run to fetch a given release of Scylla. The release is fetched from S3, and the script assumed that the user properly set up $HOME/.aws/config and $HOME/.aws/credentials to determine the source of that download and the credentials to do this. But this is unnecessary - Scylla's "downloads.scylladb.com" bucket actually allows anonymous downloads, and this is what we should use. After this patch, fetch_scylla.py (and the "--release" option of the run scripts) work correctly even for a user that doesn't have $HOME/.aws set up at all. This fix is especially important to new developers, who might not even have AWS credentials to put into these files. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#22773	2025-02-15 20:32:22 +02:00
Pavel Emelyanov	2970567b3a	streaming: Relax streaming::make_streamig_consumer() view builder arg Two callers of it -- repair and stream-manager -- both have non-sharded reference and can just use it as argument. The helper in question gets sharded<> one by itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 20:26:56 +03:00
Pavel Emelyanov	1140a875e1	streaming: Keep non-sharded view_builder dependency reference Continuation of the previous path -- view builder is started early enough and construction of stream manager can happen with non-sharded reference on it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 20:26:56 +03:00
Pavel Emelyanov	3cb9758bd1	streaming: Remove view_builder.local_is_initialized() checks Now stream_manager starts with sharded<view_builder> started and this check can be dropped. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 20:26:56 +03:00
Pavel Emelyanov	7bd3d31ac6	repair: Keep non-sharded view_builder dependency reference Continuation of the previous path -- view builder is started early enough and construction of repair service can happen with non-sharded reference on it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 20:26:56 +03:00
Pavel Emelyanov	423abc918c	repair: Remove view_builder.local_is_initialized() checks Now repair service starts with sharded<view_builder> started and those checks can be dropped. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 20:26:55 +03:00
Pavel Emelyanov	5d1f74b86a	main: Start sharded<view_builder> earlier The view_builder service is needed by repair service, but is started after it. It's OK in a sense that repair service holds a sharded reference on it and checks whether local_is_initialized() before using it, which is not nice. Fortunately, starting sharded view buidler can be done early enough, because most of its dependencies would be already started by that time. Two exceptions are -- view_update_generator and system_distributed_keyspace. Both can be moved up too with the same justification. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 20:26:55 +03:00
Pavel Emelyanov	f650e75137	test/cql_env: Move stream manager start lower This is to keep it in-sync with main code, where stream manager is started after storage_proxy's and query_processor's remotes. This doesn't change nothing for now, but next patches will move other services around main/cql_test_env and early start of stream manager in cql_test_env will be problematic. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 20:25:20 +03:00
Lakshmi Narayanan Sreethar	10fffcd646	sstables_manager: maybe_reclaim_components: yield between iterations Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar	77107ddaa3	sstables_manager: rename `increment_total_reclaimable_memory_and_maybe_reclaim()` Renamed the aboved mentioned method to `increment_total_reclaimable_memory()` as it doesn't directly reclaim memory anymore. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar	7f0f839d6d	sstables_manager: move reclaim logic into `components_reclaim_reload_fiber()` Move the sstable reclaim logic into `components_reclaim_reload_fiber()` in preparation for the fix for #21947. This also simplifies the overall reclaim/reload logic by preventing multiple fibers from attempting to reclaim/reload component memory concurrently. Also, update the existing test cases to adapt to this change. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar	f73b6abcc7	sstables_manager: rename `_sstable_deleted_event` condition variable Rename the `_sstable_deleted_event` condition variable to `_components_memory_change_event` as it will be used by future patches to signal changes in sstable component memory consumption, (i.e.) during sstable create and delete, and also when the `components_memory_reclaim_threshold` config value is changed. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar	35a4de3eeb	sstables_manager: rename `components_reloader_fiber()` A future patch will move components reclaim logic into the current `components_reloader_fiber()`, so to reflect its new purpose, rename it to `components_reclaim_reload_fiber()`. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar	4d396b9578	sstables_manager: fix `maybe_reclaim_components()` indentation Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar	f53fd40ff0	sstables_manager: reclaim components memory until usage falls below threshold The current implementation reclaims memory from SSTables only when a new SSTable is created. An upcoming patch will move this reclaim logic into the existing component reloader fiber. To support this change, the `maybe_reclaim_components()` method is updated to reclaim memory until the total memory consumption falls below the configured threshold. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar	30184ead79	sstables_manager: introduce `get_components_memory_reclaim_threshold()` Introduce `get_components_memory_reclaim_threshold()`, which returns the components' memory threshold based on the total available memory. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar	4d12ae433a	sstables_manager: extract `maybe_reclaim_components()` Extract the code from `increment_total_reclaimable_memory_and_maybe_reclaim()` that reclaims the components memory into `maybe_reclaim_components()`. The extracted new method will be used by a following patch to handle reclaim within the components reload fiber. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-02-14 22:11:04 +05:30
Lakshmi Narayanan Sreethar	59cbee6fc7	sstables_manager: fix `maybe_reload_components()` indentation Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-02-14 22:11:03 +05:30
Lakshmi Narayanan Sreethar	ce2aa15d19	sstables_manager: extract out `maybe_reload_components()` Extract the logic that reloads reclaimed components into memory in the `components_reloader_fiber()` method into a separate method. This is in preparation for moving the reclaim logic into the same fiber. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-02-14 22:11:03 +05:30
Pavel Emelyanov	8f61d26007	test/perf/s3: Add --part-size-mb option for upload test Test now uses default internal part size, but for performance comparisons its good to make it configurable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 16:27:26 +03:00
Pavel Emelyanov	6211b39f4b	test/perf/s3: Add uploading test The test picks up a file and uploads it into the bucket, then prints the time it took and uploading speed. For now it's enough, with existing S3 latencies more timing details can be obtained by turning on trace logging on s3 logger. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 16:27:26 +03:00
Pavel Emelyanov	0919a70ac8	test/perf/s3: Some renames not to be download-centric Now this test is all about reading objects. Rename some bits in it so that they can be re-used by future uploading test as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 16:27:26 +03:00
Pavel Emelyanov	24c194dcf3	test/perf/s3: Make object/file name configurable Now the download test first creates a temporary object and then reads data from it. It's good to have an option to download pre-existing file. This option will also be used for uploading test (next patches) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 16:27:25 +03:00
Pavel Emelyanov	6b27642a79	test/perf/s3: Configure maximum number of sockets Add the --sockets NR option that limits the number of sockets the underlying http client is configured to have. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 16:27:25 +03:00
Pavel Emelyanov	230d4d7c5e	test/perf/s3: Remove parallelizm The test spawns several fibers that read the same file in parallel. There's not much point in it, just makes the code harder to maintain. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 16:27:25 +03:00
Pavel Emelyanov	b52d1a3d99	s3/client: Make http client connections limit configurable It's now calculated based on sched group shares, but for tests explicit value is needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 16:27:25 +03:00
Aleksandra Martyniuk	e499f7c971	test: add test to check dcs and hosts repair filter	2025-02-14 13:46:44 +01:00
Ferenc Szili	d598750b2d	storage_proxy: allow multiple truncate table fibers per shard In order to allow concurrent truncate table operations (for the time being, only for a single table) we have to remove the limitation allowing only one truncate table fiber per shard. This change adds the ability to collect the active truncate fibers in storage_proxy::remote into std::list<> instead of having just a single truncate fiber. These fibers are waited for completion during storage_proxy::remote::stop().	2025-02-14 12:35:31 +01:00
Abhinav Jha	e491950c47	raft topology: Add support for raft topology system tables initialization to happen before group0 initialization In the current scenario, topology_change_kind variable, was been handled using _manage_topology_change_kind_from_group0 variable. This method was brittle and had some bugs(e.g. for restart case, it led to a time gap between group0 server start and topology_change_kind being managed via group0) Post _manage_topology_change_kind_from_group0 removal, careful management of topology_change_kind variable was needed for maintaining correct topology_change_kind in all scenarios. So this PR also performs a refactoring to populate all init data to system tables even before group0 creation(via raft_initialize_discovery_leader function). Now because raft_initialize_discovery_leader happens before the group 0 creation, we write mutations directly to system tables instead of a group 0 command. Hence, post group0 creation, the node can read the correct values from system tables and correct values are maintained throughout. Added a new function initialize_done_topology_upgrade_state which takes care of updating the correct upgrade state to system tables before starting group0 server. This ensures that the node can read the correct values from system tables and correct values are maintained throughout. By moving raft_initialize_discovery_leader logic to happen before starting group0 server, and not as group0 command post server start, we also get rid of the potential problem of init group0 command not being the 1st command on the server. Hence ensuring full integrity as expected by programmer. Fixes: scylladb/scylladb#21114	2025-02-14 16:56:17 +05:30
Aleksandra Martyniuk	1c8a41e2dd	test: add repair dc selection to test_tablet_metadata_persistence	2025-02-14 09:13:11 +01:00
Asias He	5545289bfa	repair: Introduce Host and DC filter support Currently, the tablet repair scheduler repairs all replicas of a tablet. It does not support hosts or DCs selection. It should be enough for most cases. However, users might still want to limit the repair to certain hosts or DCs in production. #21985 added the preparation work to add the config options for the selection. This patch adds the hosts or DCs selection support. Fixes #22417	2025-02-14 09:13:11 +01:00
Aleksandra Martyniuk	4c75701756	docs: locator: update the docs and formatter of tablet_task_info	2025-02-14 09:13:11 +01:00
Kefu Chai	b448fea260	sstable_loader: fix cross-shard resource cleanup in download_task_impl Previously, download_task_impl's destructor would destroy per-shard progress elements on whatever shard the task was destroyed on. In multi-shard environments, this caused "shared_ptr accessed on non-owner cpu" errors when attempting to free memory allocated on a different shard. Fix by: - Convert progress_per_shard into a sharded service - Stop the service on owner shards during cleanup using coroutines - Add operator+= to stream_progress to leverage seastar's built-in adder instead of a custom adder struct Alternative approaches considered: 1. Using foreign_ptr: Rejected as it would require interface changes that complicate stream delegation. foreign_ptr manages the underlying pointee with another smart pointer but does not expose the smart pointer instance in its APIs, making it impossible to use shared_ptr<stream_progress> in the interface. 2. Using vector<stream_progress>: Rejected for similar interface compatibility reasons. This solution maintains the existing interfaces while ensuring proper cross-shard cleanup. Fixes scylladb/scylladb#22759 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-02-14 11:13:58 +08:00
Kefu Chai	4c1f1baab4	tasks: make release_resources() a coroutine Convert tasks::task_manager::task::impl::release_resources() to a coroutine to prepare for upcoming changes that will implement asynchronous resource release. This is a preparatory refactoring that enables future coroutine-based implementation of resource cleanup logic. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-02-14 11:13:58 +08:00
Michał Chojnowski	294b839e34	test_rpc_compression.py: fix an overly-short timeout The timeout of 10 seconds is too small for CI. I didn't mean to make it so short, it was an accident. Fix that by changing the timeout to 10 minutes. Fixes scylladb/scylladb#22832 Closes scylladb/scylladb#22836	2025-02-13 17:49:39 +01:00
Gleb Natapov	d288d79d78	api: initialize token metadata API after starting the gossiper Token metadata API now depend on gossiper to do ip to host id mappings, so initialized it after the gossiper is initialized and de-initialized it before gossiper is stopped. Fixes: scylladb/scylladb#22743 Closes scylladb/scylladb#22760	2025-02-13 14:39:05 +01:00
Takuya ASADA	b5e306047f	dist: fix upgrade error from 2024.1 We need to allow replacing nodetool from scylla-enterprise-tools < 2024.2, just like we did for scylla-tools < 5.5. This is required to make packages able to upgrade from 2024.1. Fixes #22820 Closes scylladb/scylladb#22821	2025-02-13 12:36:24 +02:00
Botond Dénes	c57492bd73	Update tools/java submodule * tools/java 807e991d...4f1353ba (1): > dist: support smooth upgrade from enterprise to source availalbe Refs scylladb/scylladb#22820	2025-02-13 12:32:07 +02:00
Petr Gusev	12cc84f8a9	raft_rpc::send_append_entries: limit memory usage Serializing raft::append_request for transmission requires approximately the same amount of memory as its size. This means when the Raft library replicates a log item to M servers, the log item is effectively copied M times. To prevent excessive memory usage and potential out-of-memory issues, we limit the total memory consumption of in-flight raft::append_request messages. Fixes [scylladb/scylladb#14411](https://github.com/scylladb/scylladb/issues/14411)	2025-02-13 10:29:09 +01:00
Nadav Har'El	e6dcb605cb	Merge 'Fix typos' from Dmitriy Rokhfeld (TripleChecker) Hey, our tool caught a few typos in your repository. Also, here is your site's error report: https://triplechecker.com/s/Dza11H/scylladb.com Hope it's helpful! Closes scylladb/scylladb#22787 * github.com:scylladb/scylladb: Fix typos Fix typos	2025-02-13 11:14:29 +02:00
TripleChecker	8d64be94e2	Fix typos	2025-02-13 01:54:08 +02:00
Wojciech Mitros	86838a147d	test: skip test_complex_null_values in uf_typest_test test_complex_null_values is currently flaky, causing many failures in CI. The reason for the failures is unclear, and a fix might not be simple, so because UDFs are experimental, for now let's skip this test until the corresponding issue is fixed. Refs scylladb/scylladb#22799 Closes scylladb/scylladb#22818	2025-02-12 21:37:34 +01:00
Andrei Chekun	54c165c94c	test: Skip test_raft_voters because of existing issue https://github.com/scylladb/scylladb/issues/18793 Closes scylladb/scylladb#22710	2025-02-12 16:41:17 +03:00
Petr Gusev	043291a2b4	fms: extract entry_size to log_entry::get_size We intend to reuse it in subsequent commit.	2025-02-12 14:33:41 +01:00
Anna Stuchlik	b860b2109f	doc: add a warning for admins launching ScyllaDB on Azure Fixes scylladb/scylladb#22686 Refs scylladb/scylladb#22505 Closes scylladb/scylladb#22687	2025-02-12 14:27:19 +01:00
Tomasz Grabiec	d8ea780244	Merge 'scylla-gdb.py: introduce scylla tablet-metadata command' from Botond Dénes Dumps the content of the tablet metadata. Very useful for debugging tablet related problems. Example output: ``` (gdb) scylla tablet-metadata --table usertable_no_lwt This node: host_id: b90662a9-98b1-4452-bc45-44d460ecab62, shard: 0 table alternator_usertable_no_lwt.usertable_no_lwt: id: 68316fa0-78ec-11ef-af10-98d4ab71aac4, tablets: 32, resize decision: merge#1, transitions: 0 tablet#0: last token: -8646911284551352321, replicas: [b5ddcd7e-45ed-4f20-8841-353bd82cc04c#0, 84d0cb45-1c6c-4870-b727-03db3130641f#0, b933959e-8134-4ba0-8c44-33dbd51170e9#0] tablet#1: last token: -8070450532247928833, replicas: [fb0167dc-7a7d-476d-b4a5-4a55a52dadff#0, 4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#0, ac2fdd20-2f54-4960-9856-27fd07ed38ef#0] tablet#2: last token: -7493989779944505345, replicas: [fb0167dc-7a7d-476d-b4a5-4a55a52dadff#1, 4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#1, b5ddcd7e-45ed-4f20-8841-353bd82cc04c#1] tablet#3: last token: -6917529027641081857, replicas: [ac2fdd20-2f54-4960-9856-27fd07ed38ef#1, b933959e-8134-4ba0-8c44-33dbd51170e9#1, 84d0cb45-1c6c-4870-b727-03db3130641f#1] tablet#4: last token: -6341068275337658369, replicas: [ac2fdd20-2f54-4960-9856-27fd07ed38ef#2, fb0167dc-7a7d-476d-b4a5-4a55a52dadff#2, b5ddcd7e-45ed-4f20-8841-353bd82cc04c#2] tablet#5: last token: -5764607523034234881, replicas: [4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#2, b933959e-8134-4ba0-8c44-33dbd51170e9#2, 84d0cb45-1c6c-4870-b727-03db3130641f#2] tablet#6: last token: -5188146770730811393, replicas: [84d0cb45-1c6c-4870-b727-03db3130641f#3, fb0167dc-7a7d-476d-b4a5-4a55a52dadff#3, ac2fdd20-2f54-4960-9856-27fd07ed38ef#3] tablet#7: last token: -4611686018427387905, replicas: [b5ddcd7e-45ed-4f20-8841-353bd82cc04c#3, b933959e-8134-4ba0-8c44-33dbd51170e9#3, 4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#3] tablet#8: last token: -4035225266123964417, replicas: [b933959e-8134-4ba0-8c44-33dbd51170e9#4, b5ddcd7e-45ed-4f20-8841-353bd82cc04c#4, ac2fdd20-2f54-4960-9856-27fd07ed38ef#4] tablet#9: last token: -3458764513820540929, replicas: [4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#4, fb0167dc-7a7d-476d-b4a5-4a55a52dadff#4, 84d0cb45-1c6c-4870-b727-03db3130641f#4] tablet#10: last token: -2882303761517117441, replicas: [84d0cb45-1c6c-4870-b727-03db3130641f#5, fb0167dc-7a7d-476d-b4a5-4a55a52dadff#5, 4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#5] tablet#11: last token: -2305843009213693953, replicas: [ac2fdd20-2f54-4960-9856-27fd07ed38ef#5, b5ddcd7e-45ed-4f20-8841-353bd82cc04c#5, b933959e-8134-4ba0-8c44-33dbd51170e9#5] tablet#12: last token: -1729382256910270465, replicas: [b933959e-8134-4ba0-8c44-33dbd51170e9#6, b5ddcd7e-45ed-4f20-8841-353bd82cc04c#6, 84d0cb45-1c6c-4870-b727-03db3130641f#6] tablet#13: last token: -1152921504606846977, replicas: [4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#6, ac2fdd20-2f54-4960-9856-27fd07ed38ef#6, fb0167dc-7a7d-476d-b4a5-4a55a52dadff#6] tablet#14: last token: -576460752303423489, replicas: [b5ddcd7e-45ed-4f20-8841-353bd82cc04c#7, 84d0cb45-1c6c-4870-b727-03db3130641f#7, 4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#7] tablet#15: last token: -1, replicas: [b933959e-8134-4ba0-8c44-33dbd51170e9#7, ac2fdd20-2f54-4960-9856-27fd07ed38ef#7, fb0167dc-7a7d-476d-b4a5-4a55a52dadff#7] tablet#16: last token: 576460752303423487, replicas: [ac2fdd20-2f54-4960-9856-27fd07ed38ef#8, fb0167dc-7a7d-476d-b4a5-4a55a52dadff#8, 84d0cb45-1c6c-4870-b727-03db3130641f#8] tablet#17: last token: 1152921504606846975, replicas: [b933959e-8134-4ba0-8c44-33dbd51170e9#8, 4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#8, b5ddcd7e-45ed-4f20-8841-353bd82cc04c#8] tablet#18: last token: 1729382256910270463, replicas: [b5ddcd7e-45ed-4f20-8841-353bd82cc04c#9, 4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#9, fb0167dc-7a7d-476d-b4a5-4a55a52dadff#9] tablet#19: last token: 2305843009213693951, replicas: [84d0cb45-1c6c-4870-b727-03db3130641f#9, ac2fdd20-2f54-4960-9856-27fd07ed38ef#9, b933959e-8134-4ba0-8c44-33dbd51170e9#9] tablet#20: last token: 2882303761517117439, replicas: [ac2fdd20-2f54-4960-9856-27fd07ed38ef#10, 4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#10, b933959e-8134-4ba0-8c44-33dbd51170e9#10] tablet#21: last token: 3458764513820540927, replicas: [84d0cb45-1c6c-4870-b727-03db3130641f#10, b5ddcd7e-45ed-4f20-8841-353bd82cc04c#10, fb0167dc-7a7d-476d-b4a5-4a55a52dadff#10] tablet#22: last token: 4035225266123964415, replicas: [4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#11, 84d0cb45-1c6c-4870-b727-03db3130641f#11, b933959e-8134-4ba0-8c44-33dbd51170e9#11] tablet#23: last token: 4611686018427387903, replicas: [b5ddcd7e-45ed-4f20-8841-353bd82cc04c#11, ac2fdd20-2f54-4960-9856-27fd07ed38ef#11, fb0167dc-7a7d-476d-b4a5-4a55a52dadff#11] tablet#24: last token: 5188146770730811391, replicas: [b5ddcd7e-45ed-4f20-8841-353bd82cc04c#12, 84d0cb45-1c6c-4870-b727-03db3130641f#12, fb0167dc-7a7d-476d-b4a5-4a55a52dadff#12] tablet#25: last token: 5764607523034234879, replicas: [ac2fdd20-2f54-4960-9856-27fd07ed38ef#12, 4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#12, b933959e-8134-4ba0-8c44-33dbd51170e9#12] tablet#26: last token: 6341068275337658367, replicas: [b5ddcd7e-45ed-4f20-8841-353bd82cc04c#13, b933959e-8134-4ba0-8c44-33dbd51170e9#13, 84d0cb45-1c6c-4870-b727-03db3130641f#13] tablet#27: last token: 6917529027641081855, replicas: [ac2fdd20-2f54-4960-9856-27fd07ed38ef#13, fb0167dc-7a7d-476d-b4a5-4a55a52dadff#13, 4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#13] tablet#28: last token: 7493989779944505343, replicas: [b5ddcd7e-45ed-4f20-8841-353bd82cc04c#0, b933959e-8134-4ba0-8c44-33dbd51170e9#0, ac2fdd20-2f54-4960-9856-27fd07ed38ef#0] tablet#29: last token: 8070450532247928831, replicas: [fb0167dc-7a7d-476d-b4a5-4a55a52dadff#0, 84d0cb45-1c6c-4870-b727-03db3130641f#0, 4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#0] tablet#30: last token: 8646911284551352319, replicas: [fb0167dc-7a7d-476d-b4a5-4a55a52dadff#1, ac2fdd20-2f54-4960-9856-27fd07ed38ef#1, b5ddcd7e-45ed-4f20-8841-353bd82cc04c#1] tablet#31: last token: 9223372036854775807, replicas: [b933959e-8134-4ba0-8c44-33dbd51170e9#1, 4b1e8a42-e8b3-432e-bf7c-b0f7a10eb3cd#1, 84d0cb45-1c6c-4870-b727-03db3130641f#1] ``` The PR includes two marginally related small fixes too. Improvement, no backport needed. Closes scylladb/scylladb#20940 * github.com:scylladb/scylladb: scylla-gdb.py: add scylla tablet-metadata command scylla-gdb.py: register the scylla table command scylla-gdb.py: unordered_map: improve flat_hash_map matching	2025-02-12 13:27:36 +01:00
Andrei Chekun	9540e056a4	test: Add the possibility to run raft tests with pytest Closes scylladb/scylladb#22775	2025-02-12 14:10:19 +02:00
Artsiom Mishuta	b36d586d80	test.py: move mv tests into a separate folder Now that we support suite subfolders, As an example, this commit move mv tests into a separate folder custom test.py lookup also works. tests can be run as: 1. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv/tablets/test_mv_tablets_empty_ip 2. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv/tablets 3. ./tools/toolchain/dbuild ./test.py --no-gather-metrics --mode=dev topology_custom/mv	2025-02-12 12:27:26 +01:00
Artsiom Mishuta	5ca025a8c1	test.py: suport subfolders Creating an own folder used to be needed for two reasons: - we want a separate test suite, with its own settings - we want to structure tests, e.g. tablets, raft, schema, gossip. We've been creating many folders recently. However, test suite infrastructure is expensive in test.py - each suite has its own pool of servers, concurrency settings and so on. Make it possible to structure tests without too many suites, by supporting subfolders within a suite. Fixes #20570	2025-02-12 11:46:06 +01:00
Botond Dénes	7150442f6a	service/storage_proxy: schedule_repair(): materialize the range into a vector Said method passes down its `diff` input to `mutate_internal()`, after some std::ranges massaging. Said massaging is destructive -- it moves items from the diff. If the output range is iterated-over multiple times, only the first time will see the actual output, further iterations will get an empty range. When trace-level logging is enabled, this is exactly what happens: `mutate_internal()` iterates over the range multiple times, first to log its content, then to pass it down the stack. This ends up resulting in a range with moved-from elements being pased down and consequently write handlers being created with nullopt mutations. Make the range re-entrant by materializing it into a vector before passing it to `mutate_internal()`. Fixes: scylladb/scylladb#21907 Fixes: scylladb/scylladb#21714 Closes scylladb/scylladb#21910	2025-02-12 12:38:47 +02:00
Kefu Chai	6e1fb2c74e	build: limit ThinLTO link parallelism to prevent OOM in release builds When building Scylla with ThinLTO enabled (default with Clang), the linker spawns threads equal to the number of CPU cores during linking. This high parallelism can cause out-of-memory (OOM) issues in CI environments, potentially freezing the build host or triggering the OOM killer. In this change: 1. Rename `LINK_MEM_PER_JOB` to `Scylla_RAM_PER_LINK_JOB` and make it user-configurable 2. Add `Scylla_PARALLEL_LINK_JOBS` option to directly control concurrent link jobs (useful for hosts with large RAM) 3. Increase the default value of `Scylla_PARALLEL_LINK_JOBS` to 16 GiB when LTO is enabled 4. Default to 2 parallel link jobs when LTO is enabled if the calculated number if less than 2 for faster build. Notes: - Host memory is shared across job pools, so pool separation alone doesn't help - Ninja lacks per-job memory quota support - Only affects link parallelism in LTO-enabled builds See https://clang.llvm.org/docs/ThinLTO.html#controlling-backend-parallelism Fixes scylladb/scylladb#22275 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22383	2025-02-12 10:24:13 +02:00
Alexander Turetskiy	3ac533251a	allow "UTC" and "GMT" in string format of timestamp fix problem with statements like: INSERT INTO tbl (pk, time) VALUES (1, '2016-09-27 16:10:00 UTC'); fixes #20501 Closes scylladb/scylladb#22426	2025-02-12 09:38:28 +02:00
Alexander Turetskiy	47011ab830	Materialized view name length should be limited Oversized materialized view and index names are rejected; Materialized view names with invalid symbols are rejected. fixes: #20755 Closes scylladb/scylladb#21746	2025-02-11 22:16:09 +02:00
Avi Kivity	5c647408c7	systemd: map libraries close to the executable The Intel Optimizaton Manual states that branches with relative offsets greater than 2GB suffer a penalty. They cite a 6% improvement when this is avoided. Our code doesn't rely heavily on dynamically linked libraries, so I don't expect a similar win, but it's still better to do it than not. Eliminate long branches by asking the dynamic linker to restrict itself to the lower 4GB of the address space. I saw that it maps libraries at 1GB+ addresses, so this satisfies the limitation. Fix is from the Intel Optimization Manual as well. This change was ported from ScyllaDB Enterprise. Closes scylladb/scylladb#22498	2025-02-11 22:16:09 +02:00
Avi Kivity	de3b2c827f	service: topology coordinator: demote log message about refreshing stats This repeats every minute and isn't very interesting. Demote to debug to reduce log clutter. Closes scylladb/scylladb#22784	2025-02-11 22:16:09 +02:00
Botond Dénes	f808f84a45	db/config: improve description of repair_multishard_reader_enable_read_ahead The current description has a typo and in general not informative enough on when this option should be used. Closes scylladb/scylladb#21758	2025-02-11 22:16:09 +02:00
Botond Dénes	be5c28e149	scylla-gdb.py: add scylla tablet-metadata command Dumps the content of the tablet-metadata. Very useful for debugging tablet-replated problems.	2025-02-11 07:29:46 -05:00
Botond Dénes	23db82b957	scylla-gdb.py: register the scylla table command This command exists but is not registered. There is a test for it, but it happens to work only because scylla table is a prefix of scylla tables (another command), so gdb invokes that other command instead.	2025-02-11 07:29:46 -05:00
Botond Dénes	3ec8ef90fe	scylla-gdb.py: unordered_map: improve flat_hash_map matching Strip typedefs from the type before matching.	2025-02-11 07:29:30 -05:00
Avi Kivity	5adaf0a605	Merge 'tree: migrate from boost::remove_if() to the standard library based alternatives' from Kefu Chai Replace boost::remove_if() with the standard library's std::erase_if() or std::ranges::remove_if() to reduce external dependencies and simplify the codebase. This change eliminates the requirement for boost::range and makes the implementation more maintainable. --- it's a cleanup, hence no need to backport. Closes scylladb/scylladb#22788 * github.com:scylladb/scylladb: service: migrate from boost::range::remove_if() to std::ranges::remove_if sstable: migrate from boost::remove_if() to std::erase_if()	2025-02-11 14:07:48 +02:00
Kefu Chai	481397317d	sstables, test: migrate from boost::copy() to std::ranges::copy() Replace boost::copy() with the standard library's std::ranges::copy() to reduce external dependencies and simplify the codebase. This change eliminates the requirement for boost::range and makes the implementation more maintainable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22789	2025-02-11 14:55:25 +03:00
Asias He	fb318d0c81	repair: Add await_completion option for tablet_repair api Set true to wait for the repair to complete. Set false to skip waiting for the repair to complete. When the option is not provided, it defaults to false. It is useful for management tool that wants the api to be async. Fixes #22418 Closes scylladb/scylladb#22436	2025-02-11 12:49:12 +02:00
Avi Kivity	770dc37f0f	tools: toolchain: prepare: fix optimized_clang archive printout prepare helpfully prints out the path where optimized clang is stored, but a couple of typos mean it prints out an empty string. Fix that. Closes scylladb/scylladb#22714	2025-02-11 11:50:01 +02:00
Nadav Har'El	1842d456a1	test/cqlpy: fix some false failures on Cassandra Developers are expected to run new cqlpy tests against Cassandra - to verify that the new test itself is correct. Usually there is no need to run the entire cqlpy test suite against Cassandra, but when users do this, it isn't confidence-inspiring to see hundreds of tests failing. In this patch I fix many but not all of these failures. Refs #11690 (which will remain open until we fix all the failures on Cassandra) * Fixed the "compact_storage" fixture recently introduced to enable the deprecated feature in Scylla for the tests. This fixture was broken on Cassandra and caused all compact-storage related tests to fail on Cassandra. * Marked all tests in test_tombstone_limit.py as scylla_only - as they check the Scylla-only query_tombstone_page_limit configuration option. * Marked all tests in test_service_level_api.py as scylla_only - as they check the Scylla-only service levels feature. * Marked a test specific to the Scylla-only IncrementalCompactionStrategy as scylla_only. Some tests mix STCS and ICS testing in one test - this is a mistake and isn't fixed in this patch. * Various tests in test_tablets.py forgot to use skip_without_tablets to skip them on Cassandra or older Scylla that doesn't have the tablets feature. Signed-off-by: Nadav Har'El <nyh@scylladb.com> x Closes scylladb/scylladb#22561	2025-02-11 11:48:40 +02:00
Botond Dénes	4a7a75dfcb	Merge 'tasks: use host_id in task manager' from Aleksandra Martyniuk Use host_id in a children list of a task in task manager to indicate a node on which the child was created. Move TASKS_CHILDREN_REQUEST to IDL. Send it by host_id. Fixes: https://github.com/scylladb/scylladb/issues/22284. Ip to host_id transition; backport isn't needed. Closes scylladb/scylladb#22487 * github.com:scylladb/scylladb: tasks: drop task_manager::config::broadcast_address as it's unused tasks: replace ip with host_id in task_identity api: task_manager: pass gossiper to api::set_task_manager tasks: keep host_id in task_manager tasks: move tasks_get_children to IDL	2025-02-11 11:32:27 +02:00
Patryk Jędrzejczak	7b8344faa8	Merge 'Fix a regression that sometimes causes an internal error and demote barrier_and_drain rpc error log to a warning ' from Gleb Natapov The series fixes a regression and demotes a barrier_and_drain logging error to a warning since this particular condition may happen during normal operation. We want to backport both since one is a bug fix and another is trivial and reduces CI flakiness. Closes scylladb/scylladb#22650 * https://github.com/scylladb/scylladb: topology_coordinator: demote barrier_and_drain rpc failure to warning topology_coordinator: read peers table only once during topology state application	2025-02-11 10:25:35 +01:00
Pavel Emelyanov	529ff3efa5	Merge 'Alternator: implement UpdateTable operation to add or delete GSI' from Nadav Har'El In this series we implement the UpdateTable operation to add a GSI to an existing table, or remove a GSI from a table. As the individual commit messages will explained, this required changing how Alternator stores materialized view keys - instead of insisting that these key must be real columns (that is not the case when adding a GSI to an existing table), the materialized view can now take as its key any Alternator attribute serialized inside the ":attrs" map holding all non-key attributes. Fixes #11567. We also fix the IndexStatus and Backfilling attributes returned by DescribeTable - as DynamoDB API users use this API to discover when a newly added GSI completed its "backfilling" (what we call "view building") stage. Fixes #11471. This series should not be backported lightly - it's a new feature and required fairly large and intrusive changes that can introduce bugs to use cases that don't even use Alternator or its UpdateTable operations - every user of CQL materialized views or secondary indexes, as well as Alternator GSI or LSI, will use modified code. It should be backported to 2025.1, though - this version was actually branched long after this PR was sent, and it provides a feature that was promised for 2025.1. Closes scylladb/scylladb#21989 * github.com:scylladb/scylladb: alternator: fix view build on oversized GSI key attribute mv: clean up do_delete_old_entry test/alternator: unflake test for IndexStatus test/alternator: work around unrelated bug causing test flakiness docs/alternator: adding a GSI is no longer an unimplemented feature test/alternator: remove xfail from all tests for issue 11567 alternator: overhaul implementation of GSIs and support UpdateTable mv: support regular_column_transformation key columns in view alternator: add new materialized-view computed column for item in map build: in cmake build, schema needs alternator build: build tests with Alternator alternator: add function serialized_value_if_type() mv: introduce regular_column_transformation, a new type of computed column alternator: add IndexStatus/Backfilling in DescribeTable alternator: add "LimitExceededException" error type docs/alternator: document two more unimplemented Alternator features	2025-02-11 10:02:01 +03:00
Kefu Chai	a18069fad7	service: migrate from boost::range::remove_if() to std::ranges::remove_if Replace boost::range::remove_if() with the standard library's std::ranges::remove_if() to reduce external dependencies and simplify the codebase. This change eliminates the requirement for boost::range and makes the implementation more maintainable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-02-11 09:15:14 +08:00
Kefu Chai	ba724a26f4	sstable: migrate from boost::remove_if() to std::erase_if() Replace boost::remove_if() with the standard library's std::erase_if() to reduce external dependencies and simplify the codebase. This change eliminates the requirement for boost::range and makes the implementation more maintainable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-02-11 09:15:14 +08:00
TripleChecker	e72e6fadeb	Fix typos	2025-02-11 00:17:43 +02:00
Avi Kivity	6a1ee32cc3	Merge 'raft/group0_state_machine: load current RPC compression dict on startup' from Michał Chojnowski We are supposed to be loading the most recent RPC compression dictionary on startup, but we forgot to port the relevant piece of logic during the source-available port. This causes a restarted node not to use the dictionary for RPC compression until the next dictionary update. Fix that. Fixes scylladb/scylladb#22738 This is more of a bugfix than an improvement, so it should be backported to 2025.1. Closes scylladb/scylladb#22739 * github.com:scylladb/scylladb: test_rpc_compression.py: test the dictionaries are loaded on startup raft/group0_state_machine: load current RPC compression dict on startup	2025-02-10 20:40:33 +02:00
Dawid Mędrek	cd50152522	service/mapreduce_service: Cancel query when stopping Before these changes, shutting down a node could be prolonged because of mapreduce_service. `mapreduce_service::stop()` uninitializes messaging service, which includes waiting for all ongoing RPC handlers. We already had a mechanism for cancelling local mapreduce tasks, but we were missing one for cancelling external queries. In this commit, we modify the signature of the request so it supports cancelling via an abort source. We also provide a reproducer test for the problem. Fixes scylladb/scylladb#22337 Closes scylladb/scylladb#22651	2025-02-10 20:12:59 +02:00
Asias He	6f04de3efd	streaming: Fail stream plan on stream_mutation_fragments handler in case of error The following is observed in pytest: 1) node1, stream master, tried to pull data from node3 2) node3, stream follower, found node1 restarted 3) node3 killed the rpc stream 4) node1 did not get the stream session failure message from node3. This failure message was supposed to kill the stream plan on node1. That's the reason node1 failed the stream session much later at "2024-08-19 21:07:45,539". Note, node3 failed the stream on its side, so it should have sent the stream session failure message. ``` $ cat node1.log \|grep f890bea0-5e68-11ef-99ae-e5bca04385fc INFO 2024-08-19 20:24:01,162 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Executing streaming plan for Tablet migration-ks-index-0 with peers={127.0.34.3}, master ERROR 2024-08-19 20:24:01,190 [shard 1:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=127.0.34.3: seastar::nested_exception: seastar::rpc::stream_closed (rpc stream was closed by peer) (while cleaning up after seastar::rpc::stream_closed (rpc stream was closed by peer)) WARN 2024-08-19 21:07:45,539 [shard 0:main] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Streaming plan for Tablet migration-ks-index-0 failed, peers={127.0.34.3}, tx=0 KiB, 0.00 KiB/s, rx=484 KiB, 0.18 KiB/s $ cat node3.log \|grep f890bea0-5e68-11ef-99ae-e5bca04385fc INFO 2024-08-19 20:24:01,163 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Executing streaming plan for Tablet migration-ks-index-0 with peers=127.0.34.1, slave INFO 2024-08-19 20:24:01,164 [shard 1:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Start sending ks=ks, cf=cf, estimated_partitions=2560, with new rpc streaming WARN 2024-08-19 20:24:01,187 [shard 0: gms] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Streaming plan for Tablet migration-ks-index-0 failed, peers={127.0.34.1}, tx=633 KiB, 26506.81 KiB/s, rx=0 KiB, 0.00 KiB/s WARN 2024-08-19 20:24:01,188 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] stream_transfer_task: Fail to send to 127.0.34.1:0: seastar::rpc::stream_closed (rpc stream was closed by peer) WARN 2024-08-19 20:24:01,189 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Failed to send: seastar::rpc::stream_closed (rpc stream was closed by peer) WARN 2024-08-19 20:24:01,189 [shard 0:strm] stream_session - [Stream #f890bea0-5e68-11ef-99ae-e5bca04385fc] Streaming error occurred, peer=127.0.34.1 ``` To be safe in case the stream fail message is not received, node1 could fail the stream plan as soon as the rpc stream is aborted in the stream_mutation_fragments handler. Fixes #20227 Closes scylladb/scylladb#21960	2025-02-10 16:32:12 +01:00
Avi Kivity	cf72c31617	treewide: improve bash error reporting bash error handling and reporting is atrocious. Without -e it will just ignore errors. With -e it will stop on errors, but not report where the error happened (apart from exiting itself with an error code). Improve that with the `trap ERR` command. Note that this won't be invoked on intentional error exit with `exit 1`. We apply this on every bash script that contains -e or that it appears trivial to set it in. Non-trivial scripts without -e are left unmodified, since they might intentionally invoke failing scripts. Closes scylladb/scylladb#22747	2025-02-10 18:28:52 +03:00
Pavel Emelyanov	81f7a6d97d	doc: Update system.sstables table schema description The partition key had been renamed and its type changed some time ago, but the doc wasn't updated. Fix it. refs: #20998 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#22683	2025-02-10 16:09:49 +02:00
Botond Dénes	51a273401c	Merge 'test: tablets_test: Create proper schema in load balancer tests' from Tomasz Grabiec This PR converts boost load balancer tests in preparation for load balancer changes which add per-table tablet hints. After those changes, load balancer consults with the replication strategy in the database, so we need to create proper schema in the database. To do that, we need proper topology for replication strategies which use RF > 1, otherwise keyspace creation will fail. Topology is created in tests via group0 commands, which is abstracted by the new `topology_builder` class. Tests cannot modify token_metadata only in memory now as it needs to be consistent with the schema and on-disk metadata. That's why modifications to tablet metadata are now made under group0 guard and save back metadata to disk. Closes scylladb/scylladb#22648 * github.com:scylladb/scylladb: test: tablets: Drop keyspace after do_test_load_balancing_merge_colocation() scenario tests: tablets: Set initial tablets to 1 to exit growing mode test: tablets_test: Create proper schema in load balancer tests test: lib: Introduce topology_builder test: cql_test_env: Expose topology_state_machine topology_state_machine: Introduce lock transition	2025-02-10 16:08:41 +02:00
Avi Kivity	c212f5a296	db/config: forward-declare boost options_description_easy_init Reduces large dependency pull from boost. Closes scylladb/scylladb#22748	2025-02-10 15:08:11 +02:00
Nikita Kurashkin	025bb379a4	cql: remove expansion of "SELECT " in DESC MATERIALIZED VIEW This patch removes expansion of "SELECT " in DESC MATERIALIZED VIEW. Instead of explicitly printing each column, DESC command will now just use SELECT *, if view was created with it. Also, adds a correspodning test. Fixes #21154 Closes scylladb/scylladb#21962	2025-02-10 15:01:23 +02:00
Kefu Chai	c6bf9d8d11	sstables: switch from boost to std::ranges::all_of() Replace boost::algorithm::all_of_equal() to std::ranges::all_of() In order to reduce the header dependency to boost ranges library, let's use the utility from the standard library when appropriate. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22730	2025-02-10 15:44:55 +03:00
Kefu Chai	09a090e410	ent/encryption: Replace manual string suffix checks with ends_with() Replace manual string suffix comparison (length check + std::equal) with std::string::ends_with() introduced in C++20 for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22764	2025-02-10 15:42:39 +03:00
Avi Kivity	d4c531307d	replica: database.hh: drop dependency on boost ranges Reduces dependency load. Closes scylladb/scylladb#22749	2025-02-10 13:29:55 +01:00
Michael Litvak	c098e9a327	test/test_view_build_status: fix flaky asserts In few test cases of test_view_build_status we create a view, wait for it and then query the view_build_status table and expect it to have all rows for each node and view. But it may fail because it could happen that the wait_for_view query and the following queries are done on different nodes, and some of the nodes didn't apply all the table updates yet, so they have missing rows. To fix it, we change the assert to work in the eventual consistency sense, retrying until the number of rows is as expectd. Fixes scylladb/scylladb#22644 Closes scylladb/scylladb#22654	2025-02-10 12:41:42 +01:00
Kefu Chai	ca832dc4fb	.github: Make "make-pr-ready-for-review" workflow run in base repo The "make-pr-ready-for-review" workflow was failing with an "Input required and not supplied: token" error. This was due to GitHub Actions security restrictions preventing access to the token when the workflow is triggered in a fork: ``` Error: Input required and not supplied: token ``` This commit addresses the issue by: - Running the workflow in the base repository instead of the fork. This grants the workflow access to the required token with write permissions. - Simplifying the workflow by using a job-level `if` condition to controlexecution, as recommended in the GitHub Actions documentation (https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/using-conditions-to-control-job-execution). This is cleaner than conditional steps. - Removing the repository checkout step, as the source code is not required for this workflow. This change resolves the token error and ensures the "make-pr-ready-for-review" workflow functions correctly. Fixes scylladb/scylladb#22765 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22766	2025-02-10 12:56:39 +02:00
Abhi	4748125a48	service/raft: Refactor mutation writing helper functions. We use these changes in following commit.	2025-02-10 14:48:25 +05:30
Evgeniy Naydanov	06793978c1	test.py: new Python dependencies for dtest->test.py migration 3rd-party library which provide compatibility between sync and async code: universalasync Few deps from scylla-dtest: deepdiff cryptography boto3-stubs[dynamodb] [avi: regenerate frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz ] Closes scylladb/scylladb#22497	2025-02-10 10:52:27 +02:00
Kefu Chai	0185aa458b	build: cmake: remove trailing comma in db/CMakeLists.txt source list In `c5668d99`, a new source file row_cache.cc was added to the `db` target, but with an extraneous trailing comma. In CMake's target_sources(), source files should be space-separated - any comma is interpreted as part of the filename, causing build failures like: ``` CMake Error at db/CMakeLists.txt:2 (target_sources): Cannot find source file: row_cache.cc, ``` Fix the issue by removing the trailing comma. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22754	2025-02-09 17:28:47 +02:00
Nadav Har'El	a492e239e3	Merge 'test.py: Add the possibility to run boost and unit tests with pytest ' from Andrei Chekun Add the possibility to run boost and unit tests with pytest test.py should follow the next paradigm - the ability to run all test cases sequentially by ONE pytest command. With this paradigm, to have the better performance, we can split this 1 command into 2,3,4,5,100,200... whatever we want It's a new functionality that does not touch test.py way of executing the boost and unit tests. It supports the main features of test.py way of execution: automatic discovery of modes, repeats. There is an additional requirement to execute tests in parallel: pytest-xdist. To install it, execute `pip install pytest-xdist` To run test with pytest execute `pytest test/boost`. To execute only one file, provide the path filename `pytest test/boost/aggregate_fcts_test.cc` since it's a normal path, autocompletion will work on the terminal. To provide a specific mode, use the next parameter `--mode dev`, if parameter will not be provided pytest will try to use `ninja mode_list` to find out the compiled modes. Parallel execution controlled by pyest-xdist and the parameter `-n 12`. The useful command to discover the tests in the file or directory is `pytest --collect-only -q --mode dev test/boost/aggregate_fcts_test.cc`. That will return all test functions in the file. To execute only one function from the test, you can invoke the output from the previous command, but suffix for mode should be skipped, for example output will be `test/boost/aggregate_fcts_test.cc::test_aggregate_avg.dev`, so to execute this specific test function, please use the next command `pytest --mode dev test/boost/aggregate_fcts_test.cc::test_aggregate_avg` There is a parameter `--repeat` that used to repeat the test case several times in the same way as test.py did. It's not possible to run both boost and unit tests directories with one command, so we need to provide explicitly which directory should be executed. Like this `pytest --mode dev test/unit` or `pytest --mode dev test/boost` Fixes: https://github.com/scylladb/qa-tasks/issues/1775 Closes scylladb/scylladb#21108 * github.com:scylladb/scylladb: test.py: Add possibility to run ldap tests from pytest test.py: Add the possibility to run unit tests from pytest test.py: Add the possibility to run boost test from pytest test.py: Add discovery for C++ tests for pytest test.py: Modify s3 server mock test.py: Add method to get environment variables from MinIO wrapper test.py: Move get configured modes to common lib	2025-02-09 11:56:24 +01:00
Yaron Kaikov	93f53f4eb8	dist: support smooth upgrade from enterprise to source availalbe When upgrading for example from `2024.1` to `2025.1` the package name is not identical casuing the upgrade command to fail: ``` Command: 'sudo DEBIAN_FRONTEND=noninteractive apt-get dist-upgrade scylla -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold"' Exit code: 100 Stdout: Selecting previously unselected package scylla. Preparing to unpack .../6-scylla_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb ... Unpacking scylla (2025.1.0~dev-0.20250118.1ef2d9d07692-1) ... Errors were encountered while processing: /tmp/apt-dpkg-install-JbOMav/0-scylla-conf_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb /tmp/apt-dpkg-install-JbOMav/1-scylla-python3_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb /tmp/apt-dpkg-install-JbOMav/2-scylla-server_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb /tmp/apt-dpkg-install-JbOMav/3-scylla-kernel-conf_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb /tmp/apt-dpkg-install-JbOMav/4-scylla-node-exporter_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb /tmp/apt-dpkg-install-JbOMav/5-scylla-cqlsh_2025.1.0~dev-0.20250118.1ef2d9d07692-1_amd64.deb Stderr: E: Sub-process /usr/bin/dpkg returned an error code (1) ``` Adding `Obsoletes` (for rpm) and `Replaces` (for deb) Fixes: https://github.com/scylladb/scylladb/issues/22420 Closes scylladb/scylladb#22457	2025-02-08 21:56:09 +02:00
Botond Dénes	be23ebf20f	Update tools/python3 submodule * tools/python3 8415caf4...3e0b8932 (2): > reloc: collect package files correctly if the package has an optional dependency > dist: support smooth upgrade from enterprise to source availalbe Closes scylladb/scylladb#22517	2025-02-08 21:54:42 +02:00
Avi Kivity	9712390336	Merge 'Add per-table tablet options in schema' from Benny Halevy This series extends the table schema with per-table tablet options. The options are used as hints for initial tablet allocation on table creation and later for resize (split or merge) decisions, when the table size changes. * New feature, no backport required Closes scylladb/scylladb#22090 * github.com:scylladb/scylladb: tablets: resize_decision: get rid of initial_decision tablet_allocator: consider tablet options for resize decision tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member network_topology_strategy: allocate_tablets_for_new_table: consider tablet options network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0 cql3: data_dictionary: format keyspace_metadata: print "enabled":true when initial_tablets=0 cql3/create_keyspace_statement: add deprecation warning for initial tablets test: cqlpy: test_tablets: add tests for per-table tablet options schema: add per-table tablet options feature_service: add TABLET_OPTIONS cluster schema feature	2025-02-08 20:32:19 +02:00
Avi Kivity	9db9b0963f	Merge ' reader_concurrency_semaphore: set_notify_handler(): disable timeout ' from Botond Dénes `set_notify_handler()` is called after a querier was inserted into the querier cache. It has two purposes: set a callback for eviction and set a TTL for the cache entry. This latter was not disabling the pre-existing timeout of the permit (if any) and this would lead to premature eviction of the cache entry if the timeout was shorter than TTL (which his typical). Disable the timeout before setting the TTL to prevent premature eviction. Fixes: https://github.com/scylladb/scylladb/issues/22629 Backport required to all active releases, they are all affected. Closes scylladb/scylladb#22701 * github.com:scylladb/scylladb: reader_concurrency_semaphore: set_notify_handler(): disable timeout reader_permit: mark check_abort() as const	2025-02-08 20:05:03 +02:00
Kefu Chai	a6f703414a	db: switch from boost::adaptors::indirected to std::views replace boost::adaptors::indirected using std::views::transform for less header dependency. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22731	2025-02-08 17:36:46 +02:00
Avi Kivity	d3b8c9f5ef	build: update frozen toolchain to Fedora 41 with clang 19 Update from clang 18 to clang 19. perf-simple-query reports: clang 18 278102.35 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 36056 insns/op, 16560 cycles/op, 0 errors) 288801.19 tps ( 63.0 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 36018 insns/op, 16004 cycles/op, 0 errors) 287795.23 tps ( 63.0 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 36039 insns/op, 15995 cycles/op, 0 errors) 290495.86 tps ( 63.0 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 36027 insns/op, 15939 cycles/op, 0 errors) 293116.10 tps ( 63.0 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 36020 insns/op, 15780 cycles/op, 0 errors) clang 19 284742.08 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35517 insns/op, 16419 cycles/op, 0 errors) 297974.97 tps ( 63.0 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35497 insns/op, 15926 cycles/op, 0 errors) 279527.99 tps ( 63.0 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35513 insns/op, 16724 cycles/op, 0 errors) 298229.61 tps ( 63.0 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35494 insns/op, 15892 cycles/op, 0 errors) 297982.67 tps ( 63.0 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35494 insns/op, 15819 cycles/op, 0 errors) So the update delivers a nice performance improvement. Optimized clang regenerated and stored in https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz Script to prepare optimized clang updated, and upstreamed patch dropped. Closes scylladb/scylladb#22380	2025-02-08 17:18:17 +02:00
Andrei Chekun	043534acc6	test.py: Add possibility to run ldap tests from pytest Add posibility to run ldap tests with pytest. LDAP server will be created for each worker if xdist will be used. For one thread one LDAP server will be used for all tests.	2025-02-07 21:40:28 +01:00
Andrei Chekun	36ad813b94	test.py: Add the possibility to run unit tests from pytest Add the possibility to run unit tests from pytest	2025-02-07 21:40:28 +01:00
Andrei Chekun	8ef840a1c5	test.py: Add the possibility to run boost test from pytest Add the possibility to run boost test from pytest. Boost facade based on code from https://github.com/pytest-dev/pytest-cpp, but enhanced and rewritten to suite better.	2025-02-07 21:40:25 +01:00
Andrei Chekun	4addc039e5	test.py: Add discovery for C++ tests for pytest Code based on https://github.com/pytest-dev/pytest-cpp. Updated, customized, enhanced to suit current needs. Modify generate report to not modify the names, since it will break xdist way of working. Instead modification will be done in post collect but before executing the tests.	2025-02-07 19:44:06 +01:00
Andrei Chekun	fb4722443d	test.py: Modify s3 server mock Add the possibility to return environment as a dict to use it later it subprocess created by xdist, without starting another s3 mock server for each thread.	2025-02-07 19:38:53 +01:00
Andrei Chekun	7948c4561d	test.py: Add method to get environment variables from MinIO wrapper Add method to retrieve MinIO server wrapper environment variables for later processing. This change will allow to sharing connection information with other processes and allow reusing the server across multiple tests.	2025-02-07 19:38:53 +01:00
Andrei Chekun	108ef5856f	test.py: Move get configured modes to common lib This will allow using this method inside the test module for pytest launching the boost and unit tests	2025-02-07 19:38:53 +01:00
Tomasz Grabiec	1854ea2165	test: tablets: Drop keyspace after do_test_load_balancing_merge_colocation() scenario This scenario is invoked in a loop in the test_load_balancing_merge_colocation_with_random_load test case, which will cause accumulation of tablet maps making each reload slower in subsequent iterations. It wasn't a problem before because we overwritten tablet_metadata in each iteration to contain only tablets for the current table, but now we need to keep it consistent with the schema and don't do that.	2025-02-07 17:13:52 +01:00
Tomasz Grabiec	58460a8863	tests: tablets: Set initial tablets to 1 to exit growing mode After tablet hints, there is no notion of leaving growing mode and tablet count is sustained continuously by initial tablet option, so we need to lower it for merge to happen.	2025-02-07 17:13:52 +01:00
Tomasz Grabiec	ca6159fbe2	test: tablets_test: Create proper schema in load balancer tests This is in preparation for load balancer changes needed to respect per-table tablet hints and respecting per-shard tablet count goal. After those changes, load balancer consults with the replication strategy in the database, so we need to create proper schema in the database. To do that, we need proper topology for replication strategies which use RF > 1, otherwise keyspace creation will fail.	2025-02-07 17:13:52 +01:00
Tomasz Grabiec	0d259bb175	test: lib: Introduce topology_builder Will be used by load balancer tests which need more than a single-node topology, and which want to create proper schema in the database which depends on that topology, in particular creating keyspaces with replication factor > 1. We need to do that because load balancer will use replication strategy from the database as part of plan making.	2025-02-07 16:48:33 +01:00
Tomasz Grabiec	3bb9d2fbdb	test: cql_test_env: Expose topology_state_machine	2025-02-07 16:09:21 +01:00
Tomasz Grabiec	61532eb53b	topology_state_machine: Introduce lock transition Will be used in load balancer tests to prevent concurrent topology operations, in particular background load balancing. load balancer will be invoked explicitly by the test. Disabling load balancer in topology is not a solution, because we want the explicit call to perform the load balancing.	2025-02-07 16:09:21 +01:00
Ernest Zaslavsky	5a266926e5	s3_client: Increase default part size for optimal performance Set the `upload_file` part size to 50MiB, as this value provides the best performance based on tests conducted using `perf_s3_client` on an i4i.4xlarge instance. ./perf_s3_client --smp 1 --upload --object_name ./1G-test-file --sockets 1 --part_size_mb 5 INFO 2025-02-06 10:34:08,007 [shard 0:main] perf - Uploaded 1024MB in 27.768863962s, speed 36.87583335786734MB/s ./perf_s3_client --smp 1 --upload --object_name ./1G-test-file --sockets 1 --part_size_mb 10 INFO 2025-02-06 10:35:07,161 [shard 0:main] perf - Uploaded 1024MB in 28.175412552s, speed 36.34374467845414MB/s ./perf_s3_client --smp 1 --upload --object_name ./1G-test-file --sockets 1 --part_size_mb 20 INFO 2025-02-06 10:35:55,530 [shard 0:main] perf - Uploaded 1024MB in 14.483539631s, speed 70.700949221575MB/s ./perf_s3_client --smp 1 --upload --object_name ./1G-test-file --sockets 1 --part_size_mb 30 INFO 2025-02-06 10:36:35,466 [shard 0:main] perf - Uploaded 1024MB in 11.486155799s, speed 89.15080188004683MB/s ./perf_s3_client --smp 1 --upload --object_name ./1G-test-file --sockets 1 --part_size_mb 40 INFO 2025-02-06 10:37:46,642 [shard 0:main] perf - Uploaded 1024MB in 10.236196424s, speed 100.03715809898961MB/s /perf_s3_client --smp 1 --upload --object_name ./1G-test-file --sockets 1 --part_size_mb 50 INFO 2025-02-06 10:38:34,777 [shard 0:main] perf - Uploaded 1024MB in 9.490644522s, speed 107.895728011548MB/s ./perf_s3_client --smp 1 --upload --object_name ./1G-test-file --sockets 1 --part_size_mb 60 INFO 2025-02-06 10:39:08,832 [shard 0:main] perf - Uploaded 1024MB in 9.767783693s, speed 104.83442633295012MB/s ./perf_s3_client --smp 1 --upload --object_name ./1G-test-file --sockets 1 --part_size_mb 70 INFO 2025-02-06 10:39:47,916 [shard 0:main] perf - Uploaded 1024MB in 10.166116742s, speed 100.72675988162482MB/s Closes scylladb/scylladb#22732	2025-02-07 13:49:54 +03:00
Pavel Emelyanov	3cb0581022	Merge '.github: improve license header check workflow' from Kefu Chai This patch series contains improvements to our GitHub license header check workflow. The first patch grants necessary write permissions to the workflow, allowing it to comment directly on pull requests when license header issues are found. This addresses a permissions-related error that previously prevented the workflow from creating comments. The second patch optimizes the workflow by skipping the license check step when no relevant files have been modified in the pull request. This prevents unnecessary workflow failures that occurred when the check was run without any files to analyze. Together, these changes make the license header checking process more robust and efficient. The workflow now properly communicates findings through PR comments and avoids running unnecessary checks. --- no need to backport, as the workflow updated by this change only exists in master. Closes scylladb/scylladb#22736 * github.com:scylladb/scylladb: .github: grant write permissions for PR comments in license check workflow .github: skip license check when no relevant files changed	2025-02-07 13:47:53 +03:00
Alexey Novikov	cc35905531	Allow to use memtable_flush_period_in_ms schema option for system tables It's possible to modify 'memtable_flush_period_in_ms' option only and as single option, not with any other options together Refs #20999 Fixes #21223 Closes scylladb/scylladb#22536	2025-02-07 10:33:05 +02:00
Kefu Chai	06b4abce56	.github: grant write permissions for PR comments in license check workflow Grant write permissions to the check-license-header workflow to enable commenting on pull requests. This fixes the "Resource not accessible by integration" HTTP error that occurred when the workflow attempted to create comments. The permission is required according to GitHub's API documentation for creating issue comments. see also https://docs.github.com/en/rest/issues/comments?apiVersion=2022-11-28#create-an-issue-comment Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-02-07 16:09:42 +08:00
Kefu Chai	342b640b4b	.github: skip license check when no relevant files changed Skip the license header check step in `check-license-header.yaml` workflow when no files with configured extensions were changed in the pull request. Previously, the workflow would fail in this case since the --files argument requires at least one file path: ``` check-license.py: error: argument --files: expected at least one argument ``` Add `if` condition to only run the check when steps.changed-files.outputs.files is not empty. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-02-07 16:09:42 +08:00
Yaron Kaikov	d50738feca	./github/workflows/pr-require-backport-label: fix regex to match source available version Until now this action checked if we have a `backport/none` or `backport/x.y` label only, since we moved to the source available and the releases like 2025.1 don't match this regex this action keeps failing Closes scylladb/scylladb#22734	2025-02-07 10:03:00 +02:00
Botond Dénes	9174f27cc8	reader_concurrency_semaphore: set_notify_handler(): disable timeout set_notify_handler() is called after a querier was inserted into the querier cache. It has two purposes: set a callback for eviction and set a TTL for the cache entry. This latter was not disabling the pre-existing timeout of the permit (if any) and this would lead to premature eviction of the cache entry if the timeout was shorter than TTL (which his typical). Disable the timeout before setting the TTL to prevent premature eviction. Fixes: #scylladb/scylladb#22629	2025-02-07 02:31:01 -05:00
Botond Dénes	a3ae0c7cee	reader_permit: mark check_abort() as const All it does is read one field, making it const makes using it easier.	2025-02-07 01:32:35 -05:00
Ernest Zaslavsky	97d789043a	s3_client: Fix buffer offset reset on request retry This patch addresses an issue where the buffer offset becomes incorrect when a request is retried. The new request uses an offset that has already been advanced, causing misalignment. This fix ensures the buffer offset is correctly reset, preventing such errors. Closes scylladb/scylladb#22729	2025-02-07 08:52:08 +03:00
Pavel Emelyanov	f331d3b876	Merge 'auth: ensure default superuser password is set before serving CQL' from Andrzej Jackowski Before this change, it was ensured that a default superuser is created before serving CQL. However, the mechanism didn't wait for default password initialization, so effectively, for a short period, customer couldn't authenticate as the superuser properily. The purpose of this change is to improve the superuser initialization mechanism to wait for superuser default password, just as for the superuser creation. This change: - Introduce authenticator::ensure_superuser_is_created() to allow waiting for complete initialization of super user authentication - Implement ensure_superuser_is_created in password_authenticator, so waiting for superuser password initialization is possible - Implement ensure_superuser_is_create in transitional_authenticator, so the implementation from password_authenticator is used - Implement no-op ensure_superuser_is_create for other authenticators - Extend service::ensure_superuser_is_created to wait for superuser initialization in authenticator, just as it was implemented earlier for role_manager - Add injected error (sleep) in password_authenticator::start to reproduce a case of delayed password creation - Implement test_delayed_deafult_password to verify the correctness of the fix - Ensure superuser is created in single_node_cql_env::run_in_thread to make single_node_cql more similar to scylla_main in main.cc Fixes scylladb/scylladb#20566 Backport not needed - a minor bugfix Closes scylladb/scylladb#22532 * github.com:scylladb/scylladb: test: implement test_auth_password_ensured test: implement connect_driver argument in ManagerClient::server_add auth: ensure default superuser password is set before serving CQL auth: added password_authenticator_start_pause injected error	2025-02-07 08:47:01 +03:00
Michał Chojnowski	8fb2ea61ba	test_rpc_compression.py: test the dictionaries are loaded on startup Reproduces scylladb/scylladb#22738	2025-02-07 04:21:23 +01:00
Michał Chojnowski	dd82b40186	raft/group0_state_machine: load current RPC compression dict on startup We are supposed to be loading the most recent RPC compression dictionary on startup, but we forgot to port the relevant piece of logic during the source-available port.	2025-02-07 04:20:21 +01:00
Avi Kivity	861fb58e14	Merge 'vector: add support for vector type' from Dawid Pawlik This pull request is an implementation of vector data type similar to one used by Apache Cassandra. The patch contains: - implementation of vector_type_impl class - necessary functionalities similar to other data types - support for serialization and deserialization of vectors - support for Lua and JSON format - valid CQL syntax for `vector<>` type - `type_parser` support for vectors - expression adjustments such as: - add `collection_constructor::style_type::vector` - rename `collection_constructor::style_type::list` to `collection_constructor::style_type::list_or_vector` - vector type encoding (for drivers) - unit tests - cassandra compatibility tests - necessary documentation Co-authored-by: @janpiotrlakomy Fixes https://github.com/scylladb/scylladb/issues/19455 Closes scylladb/scylladb#22488 * github.com:scylladb/scylladb: docs: add vector type documentation cassandra_tests: translate tests covering the vector type type_codec: add vector type encoding boost/expr_test: add vector expression tests expression: adjust collection constructor list style expression: add vector style type test/boost: add vector type cql_env boost tests test/boost: add vector type_parser tests type_parser: support vector type cql3: add vector type syntax types: implement vector_type_impl	2025-02-06 20:36:50 +02:00
Benny Halevy	021fc3c756	tablets: resize_decision: get rid of initial_decision Now, with tablet_hints calculation of min_tablet_count it is not used anymore. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 18:43:47 +02:00
Benny Halevy	20c6ca2813	tablet_allocator: consider tablet options for resize decision Do not merge tablets if that would drop the tablet_count below the minimum provided by hints. Split tablets if the current tablet_count is less than the minimum tablet count calculated using the table's tablet options. TODO: override min_tablet_count if the tablet count per shard is greater than the maximum allowed. In this case the tables tablet counts should be scaled down proportionally. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 18:43:35 +02:00
Nadav Har'El	c2b870ee54	Merge 'De-duplicate validation of tables in some column_family API endpoints' from Pavel Emelyanov In column_family.cc and storage_service.cc there exist a bunch of helpers that parse and/or validate ks/cf names, and different endpoints use different combinations of those, duplicating the functionality of each other and generating some mess. This PR cleans the endpoints from column_family.cc that parse and validate fully qualified table name (the '$ks:$cf' string). A visible "improvement" is that `validate_table()` helper usage in the api/ directory is narrowed down to storage_service.cc file only (with the intent to remove that helper completely), and the aforementioned `for_tables_on_all_shards()` helper becomes shorter and tiny bit faster, because it doesn't perform some re-lookups of tables, that had been performed by validation sanity checks before it. There's more to be done in those helpers, this PR wraps only one part of this mess. Below is the list of endpoints this PR affects and the tests that validate the changes: \|endpoint\|test\| \|-\|-\| \|column_family/autocompaction\|rest_api/test_column_family::test_column_family_auto_compaction_table\| \|column_family/tombstone_gc\|rest_api/test_column_family::test_column_family_tombstone_gc_api\| \|column_family/compaction_strategy\|rest_api/test_column_family/test_column_family_compaction_strategy\| \|compaction_manager/stop_keyspace_compaction/\|rest_api/test_compaction_manager::{test_compaction_manager_stop_keyspace_compaction,test_compaction_manager_stop_keyspace_compaction_tables}\| Closes scylladb/scylladb#21533 * github.com:scylladb/scylladb: api: Hide parse_tables() helper api: Use parse_table_infos() in stop_keyspace_compaction handler api: Re-use parse_table_info() in column_family API api: Make get_uuid() return table_info (and rename) api: Remove keyspace argument from for_table_on_all_shards() api: Switch for_table_on_all_shards() to use table_info-s api: Hide validate_table() helper api: Tables vector is never empty now in for_table_on_all_shards() api: Move vectors of tables, not copy api: Add table validation to set_compaction_strategy_class endpoint api: Use get_uuid() to validate_table() in column family API api: Use parse_table_infos() in column family API	2025-02-06 17:28:08 +01:00
Avi Kivity	c33bbc884b	types: listlike_partially_deserializing_iterator: improve compatibility with std::ranges Range concepts require an iterator_concept tag and a default constructor, so provide those. Closes scylladb/scylladb#22138	2025-02-06 15:32:28 +03:00
Kefu Chai	5c7ad745fd	db: do not include unused headers these unused includes were identified by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. also, took this opportunity to remove an unused namespace alias. and add an include which is used actually. please note, `std::ranges::pop_heap()` and friends are actually provided by `<algorithm>` not `<ranges>`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22716	2025-02-06 13:38:19 +02:00
Andrzej Jackowski	d5a4f3d4cd	test: implement test_auth_password_ensured Before fix of scylladb#20566, CQL was served irrespectively of default superuser password creation, which led to an incorrect product behavior and sporadic test failures. This test verifies race condition of serving CQL and creating default superuser password. Injected failure is used to ensure CQL use is attempted before default superuser password creation, however, the attempt is expected to fail because scylladb#20566 is fixed. Following that, the injected error is notified, so CQL driver can be started correctly. Finally, CREATE USER query is executed to confirm successful superuser authentication. This change: - Implement test_auth_password_ensured.py The test starts a server without expecting CQL serving, because expected_server_up_state=ServerUpState.HOST_ID_QUERIED and connect_driver=False. Error password_authenticator_start_pause is injected to block superuser password setup during server startup. Next, the test waits for a log to confirm that the code implementing injected error is reached. When the server startup procedure is unfinished, some operations might not complete on a first try, so waiting for driver connection is wrapped in repeat_if_host_unavailable.	2025-02-06 10:30:55 +01:00
Andrzej Jackowski	e70ba7e3ed	test: implement connect_driver argument in ManagerClient::server_add This commit introduces connect_driver argument in ManagerClient::server_add. The argument allow skipping CQL driver initialization part during server start. Starting a server without the driver is necessary to implement some test scenarios related to system initialization. After stopping a server, ManagerClient::server_start can be used to start the server again, so connect_driver argument is also added here to allow preventing connecting the driver after a server restart. This change: - Implement connect_driver argument in ManagerClient::server_add - Implement connect_driver argument in ManagerClient::server_start	2025-02-06 10:30:55 +01:00
Andrzej Jackowski	7391c9419f	auth: ensure default superuser password is set before serving CQL Before this change, it was ensured that a default superuser is created before serving CQL. However, the mechanism didn't wait for default password initialization, so effectively, for a short period, customer couldn't authenticate as the superuser properily. The purpose of this change is to improve the superuser initialization mechanism to wait for superuser default password, just as for the superuser creation. This change: - Introduce authenticator::ensure_superuser_is_created() to allow waiting for complete initialization of super user authentication - Implement ensure_superuser_is_created in password_authenticator, so waiting for superuser password initialization is possible - Implement ensure_superuser_is_create in transitional_authenticator, so the implementation from password_authenticator is used - Implement no-op ensure_superuser_is_create for other authenticators - Modify service::ensure_superuser_is_created to wait for superuser initialization in authenticator, just as it was implemented earlier for role_manager Fixes scylladb/scylladb#20566	2025-02-06 10:30:55 +01:00
Andrzej Jackowski	7c63df085c	auth: added password_authenticator_start_pause injected error This change: - Implement password_authenticator_start_pause injected error to allow deterministic blocking of default superuser password creation This change facilitates manual testing of system behavior when default superuser password is being initialized. Moreover, this mechanism will be used in next commits to implement a test to verify a fix for erroneous CQL serving before default superuser password creation.	2025-02-06 10:30:45 +01:00
Kefu Chai	5443d9dabb	.github: add check-license-header workflow this workflow checks the first 10 lines for "LicenseRef-ScyllaDB-Source-Available-1.0" in newly introduced files when a new pull request is created against "master" or "next". if "LicenseRef-ScyllaDB-Source-Available-1.0" is not found, the workflow fails. for the sake of simplicity, instead of parsing the header for SPDX License ID, we just check to see if the "LicenseRef-ScyllaDB-Source-Available-1.0" is included. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22065	2025-02-06 12:20:23 +03:00
Nadav Har'El	cae8a7222e	alternator: fix view build on oversized GSI key attribute Before this patch, the regular_column_transformation constructor, which we used in Alternator GSIs to generates a view key from a regular-column cell, accepted a cell of any size. As a reviewer (Avi) noticed, very long cells are possible, well beyond what Scylla allows for keys (64KB), and because regular_column_transformation stores such values in a contiguous "bytes" object it can cause stalls. But allowing oversized attributes creates an even more accute problem: While view building (backfilling in DynamoDB jargon), if we encounter an oversized (>64KB) key, the view building step will fail and the entire view building will hang forever. This patch fixes both problems by adding to regular_column_transformation's constructor the check that if the cell is 64KB or larger, an empty value is returned for the key. This causes the backfilling to silently skip this item, which is what we expect to happen (backfilling cannot do anything to fix or reject the pre-existing items in the best table). A test test_gsi_updatetable.py::test_gsi_backfill_oversized_key is introduced to reproduce this problem and its fix. The test adds a 65KB attribute to a base table, and then adds GSIs to this table with this attribute as its partition key or its sort key. Before this patch, the backfilling process for the new GSIs hangs, and never completes. After this patch, the backfilling completes and as expected contains other base-table items but not the item with the oversized attribute. The new test also passes on DynamoDB. However, while implementing this fix I realized that issue #10347 also exists for GSIs. Issue #10347 is about the fact that DynamoDB limits partition key and sort key attributes to 2048 and 1024 bytes, respectively. In the fix described above we only handled the accute case of lengths above 64 KB, but we should actually skip items whose GSI keys are over 2048 or 1024 bytes - not 64KB. This extra checking is not handled in this patch, and is part of a wider existing issue: Refs #10347 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:50 +01:00
Nadav Har'El	7a0027bacc	mv: clean up do_delete_old_entry The function do_delete_old_entry() had an if() which was supposedly for the case of collection column indexing, and which our previous patch that improved this function to support caller-specified deletion_ts left behind. As a reviewer noticed, the new tombstone-setting code was in an "else" of that existing if(), and it wasn't clear what happens if we get to that else in the collection column indexing. So I reviewed the code and added breakpoints and realized that in fact, do_delete_old_entry() is never called for the collection-indexing case, which has its own update_entry_for_computed_column() which view_updates::generate_update() calls instead of the do_delete_old_entry() function and its friends. So it appears that do_delete_old_entry() doesn't need that special case at all, which simplifies it. We should eventually simplify this code further. In particular, the function generate_update() already knows the key of the rows it adds or deletes so do_delete_old_entry() and its friends don't need to call get_view_rows() to get it again. But these simplifications and other will need to come in a later patch series, this one is already long enough :-) Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:49 +01:00
Nadav Har'El	67d2ea4c4b	test/alternator: unflake test for IndexStatus The test for IndexStatus verifies that on a newly created table and GSI, the IndexStatus is "ACTIVE". However, in Alternator, this doesn't strictly need to happen immediately - view building, even for an empty table - can take a short while in debug mode. This make the test test test_gsi_describe_indexstatus flaky in debug mode. The fix is to wait for the GSI to become active with wait_for_gsi() before checking it is active. This is sort of silly and redundant, but the important point that if the IndexStatus is incorrect this test will fail, it doesn't really matter whether the wait_for_gsi() or the DescribeTable assertion is what fails. Now that wait_for_gsi() is used in two test files, this patch moves it (and its friend, wait_for_gsi_gone()) to util.py. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:49 +01:00
Nadav Har'El	4ba17387e6	test/alternator: work around unrelated bug causing test flakiness The alternator test test_gsi_updatetable.py::test_gsi_delete_with_lsi Creates a GSI together with a table, and then deletes it. We have a bug unrelated to the purpose of this test - #9059 - that causes view building to sometimes crash Scylla if the view is deleted while the view build is starting. We see specifically in debug builds that even view building of an empty table might not finish before the test deletes the view - so this bug happens. Work around that bug by waiting for the GSI to build after creating the table with the GSI. This shouldn't be necessary (in DynamoDB, a GSI created with the table always begins ready with the table), but doesn't hurt either. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:49 +01:00
Nadav Har'El	42eabb3b6f	docs/alternator: adding a GSI is no longer an unimplemented feature The previous patches implemented issue #11567 - adding a GSI to a pre-existing table. So we can finally remove the mention of this feature as an "unimplemented feature" in docs/alternator/compatibility.md. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:49 +01:00
Nadav Har'El	ac648950f1	test/alternator: remove xfail from all tests for issue 11567 The previous patches fully implemented issue 11567 - supporting UpdateTable to add or delet a GSI on an existing Alternator table. All 14 tests that were marked xfail because of this issue now pass, so this patch removes their xfail. There are no more xfailing tests referring to this issue. These 14 tests, most of them in test/alternator/test_gsi_updatetable.py, cover all aspects of this feature, including adding a GSI, deleting a GSI, interactions between GSI and LSI, RBAC when adding or deleting a GSI, data type limitation on an attribute that becomes a GSI key or stops being one, GSI backfill, DescribeTable and backfill, various error conditions, and more. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:49 +01:00
Nadav Har'El	9bfa6bf267	alternator: overhaul implementation of GSIs and support UpdateTable The main goal of this patch is to fully support UpdateTable's ability to add a GSI to an existing table, and delete a GSI from an existing table. But to achieve this, this patch first needs to overhaul how GSIs are implemented: Remember that in Alternator's data model, key attributes in a table are stored as real CQL columns (with a known type), but all other attributes of an item are stored in one map called ":attrs". * Before this patch, the GSI's key columns were made into real columns in the table's schema, and the materialized view used that column as the view's key. * After this patch, the GSI's key columns usually (when they are not the base table's keys, and not any LSI's key) are left in the ":attrs" map, just like any other non-key column. We use a new type of computed column (added in the previous patch) to extract the desired element from this map. This overhaul of the GSI implementation doesn't change anything in the functionality of GSIs (and the Alternator test suite tries very hard to ensure that), but finally allows us to add a GSI to an already-existing table. This is now possible because the GSI will be able to pick up existing data from inside the ":attrs" map where it is stored, instead of requiring the data in the map to be moved to a stand-alone column as the previous implementation needed. So this patch also finally implements the UpdateTable operations (Create and Delete) to add or delete a GSI on an existing table, as this is now fairly straightfoward. For the process of "backfilling" the existing data into the new GSI we don't need to do anything - this is just the materialized-view "view building" process that already exists. Fixes #11567. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:49 +01:00
Nadav Har'El	bc7b5926d2	mv: support regular_column_transformation key columns in view In an earlier patch, we introduced regular_column_transformation, a new type of computed column that does a computation on a cell in regular column in the base and returns a potentially transformed cell (value or deletion, timestamp and ttl). In this patch, we wire the materialized view code to support this new kind of computed column that is usable as a materialized-view key column. This new type of computed column is not yet used in this patch - this will come in the next patch, where we will use it for Alternator GSIs. Before this patch, the logic of deciding when the view update needs to create a new row or delete a new one, and which timestamp and ttl to give to the new row, could depend on one (or two - in Alternator) cells read from base-table regular columns. In this patch, this logic is rewritten - the notion of "base table regular columns" is generalized to the notion of "updatable view key columns" - these are view key columns that an update may change - because they really are base regular columns, or a computed function of one (regular_column_transformation). In some sense, the new code is easier to understand - there is no longer a separate "compute_row_marker()" function, rather the top-level generate_update() is now in charge of finding the "updatable view key columns" and calculate the row marker (timestamp and ttl) as part of deciding what needs to be done. But unfortunately the code still has separate code paths for "collection secondary indexing", and also for old-style column_computation (basically, only token_column_computation). Perhaps in the future this can be further simplified. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:49 +01:00
Nadav Har'El	ea87b9fff0	alternator: add new materialized-view computed column for item in map This patch adds a new computed column class for materialized views, extract_from_attrs_column_computation which is Alternator-specific and knows how to extract a value (of a known type) from an attribute stored in Alternator's map-of-all-nonkey- attributes ":attrs". We'll use this new computed column in the next patch to reimplement GSI. The new computed-column class is based on regular_column_transformation introduced in the previous patch. It is not yet wired to anything: The MV code cannot handle any regular_column_transformation yet, and Alternator will not yet use it to create a GSI. We'll do those things in the following patches. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:48 +01:00
Nadav Har'El	e8d1e8a515	build: in cmake build, schema needs alternator This patch is to cmake what the previous patch was to configure.py. In the next patch we want to make schema/schema.o depend on alternator/executor.o - because when the schema has an Alternator computed column, the schema code needs to construct the computed column object (extract_from_attrs_column_computation) and that lives in alternator/executor.o. In the cmake-based build, all the schema/* objects are put into one library "libschema.a". But code that uses this library (e.g., tests) can't just use that library alone, because it depends on other code not in schema/. So CMakeLists.txt lists other "libraries" that libschema.a depends on - including for example "cql3". We now need to add "alternator" to this dependency list. The dependency is marked "PRIVATE" - schema needs alternator for its own internal uses, but doesn't need to export alternator's APIs to its own users. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:48 +01:00
Nadav Har'El	1ebdf1a9f7	build: build tests with Alternator For an unknown (to me) reason, configure.py has two separate source file lists - "scylla_core" and "alternator". Scylla, and almost all tests, are compiled with both lists, but just a couple of tests were compiled with just scylla_core without alternator. In the next patch we want to make schema/schema.o depened on alternator/executor.o because when the schema has an Alternator computed column, the schema code needs to construct the computed column object (extract_from_attrs_column_computation) and that lives in alternator/executor.o. This change will break the build of the two tests that do not include the Alternator objects. So let's just add the "alternator" dependencies to the couple of tests that were missing it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:48 +01:00
Nadav Har'El	828cc98e4c	alternator: add function serialized_value_if_type() This patch introduces a function serialized_value_if_type() which takes a serialized value stored in the ":attrs" map, and converts it into a serialized CQL type if it matches a particular type (S, B or N) - or returns null the value has the wrong type. We will use this function in the following patch for deserializing values stored in the ":attrs" map to use them as a materialized view key. If the value has the right type, it will be converted to the CQL type and used as the key - but if it has the wrong type the key will be null and it will not appear in the view. This is exactly how GSI is supposed to behave. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:48 +01:00
Nadav Har'El	c8ea9f8470	mv: introduce regular_column_transformation, a new type of computed column In the patches that follow, we want Alternator to be able to use as a key for a materialized view (GSI) not a real column from the schema, but rather an attribute value deserialized from a member of the ":attrs" map. For this, we need the ability for materialized view to define a key column which is computed as function of a real column (":attrs"). We already have an MV feature which we called "computed column" (column_computation), but it is wholy inadequate for this job: column_computation can only take a partition key, and produce a value - while we need it to take a regular column (one member of ":attrs"), not just the partition key, and return a cell - value or deletion, timestamp and TTL. So in this patch we introduce a new type of computed column, which we called "regular_column_transformation" since it intends to perform some sort of transformation on a single column (or more accurately, a single atomic cell). The limitation that this function transforms a single column only is important - if we had a function of multiple columns, we wouldn't know which timestamp or ttl it should use for the result if the two columns had different timestamps or TTLs. The new class isn't wired to anything yet: The MV code cannot handle it yet, and the Alternator code will not use it yet. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:48 +01:00
Nadav Har'El	cea7aacc52	alternator: add IndexStatus/Backfilling in DescribeTable This patch adds the missing IndexStatus and Backfilling fields for the GSIs listed by a DescribeTable request. These fields allow an application to check whether a GSI has been fully built (IndexStatus=ACTIVE) or currently being built (IndexStatus=CREATING, Backfilling=true). This feature is necessary when a GSI can be added to an existing table so its backfilling might take time - and the application might want to wait for it. One test - test_gsi.py::test_gsi_describe_indexstatus - begins to pass with this fix, so the xfail tag is removed from it. Fixes #11471. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:48 +01:00
Nadav Har'El	6239e92776	alternator: add "LimitExceededException" error type This patch adds to Alternator's api_error type yet another type of error, api_error::limit_exceeded (error code "LimitExceededException"). DynamoDB returns this error code in certain situations where certain low limits were exceeded, such as the case we'll need in a following patch - an UpdateTable that tries to create more than one GSI at once. The LimitExceededException error type should not be confused with other similarly-named but different error messages like ProvisionedThroughputExceededException or RequestLimitExceeded. In general, we make an attempt to return the same error code that DynamoDB returns for a given error. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:47 +01:00
Nadav Har'El	279fe43ebe	docs/alternator: document two more unimplemented Alternator features Two new features were added to DynamoDB this month - MultiRegionConsistency and WarmThroughput. Document them as unimplemented - and link to the relevant issue in our bug tracker - in docs/alternator/compatibility.md. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-02-06 09:59:47 +01:00
Pavel Emelyanov	64baab1b95	Merge 'config: prevent SIGHUP from changing non-liveupdatable parameters' from Andrzej Jackowski Before this change, it was possible to change non-liveupdatable config parameter without process restart. This erroneous behavior not only contradicts the documentation but is potentially dangerous, as various components theoretically might not be prepared for a change of configuration parameter value without a restart. The issue came from a fact that liveupdatability verification check was skipped for default configuration parameters (those without its initial values in configuration file during process start). This change: - Introduce _initialization_completed member in config_file - Set _initialization_completed=true when config file is processed on server start - Verify config_file's initialization status during config update - if config_file was initialized, prevent from further changes of non-liveupdatable parameters - Implement ScyllaRESTAPIClient::get_config() that obtains a current value of given configuration parameter via /v2/config REST API - Implement test to confirm that only liveupdatable parameters are changed when SIGHUP is sent after configuration file change Function set_initialization_completed() is called only once in main.cc, and the effect is expected to be visible in all shards, as a side effect of cfg->broadcast_to_all_shards() that is called shortly after. The same technique was already used for enable_3_1_0_compatibility_mode() call. Fixes scylladb/scylladb#5382 No backport - minor fix. Closes scylladb/scylladb#22655 * github.com:scylladb/scylladb: test: SIGHUP doesn't change non-liveupdatable configuration test: implement ScyllaRESTAPIClient::get_config() config: prevent SIGHUP from changing non-liveupdatable parameters config: remove unused set_value_on_all_shards(const YAML::Node&)	2025-02-06 11:33:59 +03:00
Pavel Emelyanov	951625ca13	Merge 's3 client: add aws credentials providers' from Ernest Zaslavsky This update introduces four types of credential providers: 1. Environment variables 2. Configuration file 3. AWS STS 4. EC2 Metadata service The first two providers should only be used for testing and local runs. They must NEVER be used in production. The last two providers are intended for use on real EC2 instances: - AWS STS: Preferred method for obtaining temporary credentials using IAM roles. - EC2 Metadata Service: Should be used as a last resort. Additionally, a simple credentials provider chain is created. It queries each provider sequentially until valid credentials are obtained. If all providers fail, it returns an empty result. fixes: #21828 Closes scylladb/scylladb#21830 * github.com:scylladb/scylladb: docs: update the `object_storage.md` and `admin.rst` aws creds: add STS and Instance Metadata service credentials providers aws creds: add env. and file credentials providers s3 creds: move credentials out of endpoint config	2025-02-06 11:12:37 +03:00
Benny Halevy	559f083dc6	tablet_allocator: load_balancer: table_size_desc: keep target_tablet_size as member Rather than target_max_tablet_size. We need both the target as well as max and min tablet sizes, so there is no sense in keeping the max and deriving the target and the minimum for the max value. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:59:32 +02:00
Benny Halevy	32c2f7579f	network_topology_strategy: allocate_tablets_for_new_table: consider tablet options Use the keyspace initial_tablets for min_tablet_count, if the latter isn't set, then take the maximum of the option-based tablet counts: - min_tablet_count - and expected_data_size_in_gb / target_tablet_size - min_per_shard_tablet_count (via calculate_initial_tablets_from_topology) If none of the hints produce a positive tablet_count, fall back to calculate_initial_tablets_from_topology * initial_scale. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:59:32 +02:00
Benny Halevy	86bcf4cffe	network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner Current implementation is inefficient as it calls get_datacenter_token_owners_ips and then find_node(ep) while for_each_node easily provides a host_id for is_normal_token_owner. Then, since we're interested only in datacenters configure with a replication factor (but it still might be 0), simply iterate over the dc->rf map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:59:30 +02:00
Benny Halevy	49dacb1d52	network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0 Currently, if a datacenter has no replication_factor option we consider its replication factor to be 1 in calculate_initial_tablets_from_topology, but since we're not going to have any replica on it, it should be 0. This is very minor since in the worst case, it will pessimize the calculation and calculate a value for initial_tablets that's higher than it could be. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:55:51 +02:00
Benny Halevy	8aace28397	cql3: data_dictionary: format keyspace_metadata: print "enabled":true when initial_tablets=0 Keyspace `initial` tablets option is deprecated and may be removed in the future. Rather than relying on `initial`:0 to always enabled tablets, explicitly print "enabled":true when tablets are enabled and initial_tablets=0, same as keyspace_metadata::describe. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:55:51 +02:00
Benny Halevy	1054e05491	cql3/create_keyspace_statement: add deprecation warning for initial tablets Per-table hints should be used instead. Note: the warning is produced by check_against_restricted_replication_strategies which is called also from alter_keyspace_statement. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:55:51 +02:00
Benny Halevy	7cd29810a0	test: cqlpy: test_tablets: add tests for per-table tablet options Test specifying of per-table tablet options on table creation and alter table. Also, add a negative test for atempting to use tablet options with vnodes (that should fail). And add a basic test for testing tablet options also with materialized views. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:55:51 +02:00
Benny Halevy	c5668d99c9	schema: add per-table tablet options Unlike with vnodes, each tablet is served only by a single shard, and it is associated with a memtable that, when flushed, it creates sstables which token-range is confined to the tablet owning them. On one hand, this allows for far better agility and elasticity since migration of tablets between nodes or shards does not require rewriting most if not all of the sstables, as required with vnodes (at the cleanup phase). Having too few tablets might limit performance due not being served by all shards or by imbalance between shards caused by quantization. The number of tabelts per table has to be a power of 2 with the current design, and when divided by the number of shards, some shards will serve N tablets, while others may serve N+1, and when N is small N+1/N may be significantly larger than 1. For example, with N=1, some shards will serve 2 tablet replicas and some will serve only 1, causing an imbalance of 100%. Now, simply allocating a lot more tablets for each table may theoretically address this problem, but practically: a. Each tablet has memory overhead and having too many tablets in the system with many tables and many tablets for each of them may overwhelm the system's and cause out-of-memory errors. b. Too-small tablets cause a proliferation of small sstables that are less efficient to acces, have higher metadata overhead (due to per-sstable overhead), and might exhaust the system's open file-descriptors limitations. The options introduced in this change can help the user tune the system in two ways: 1. Sizing the table to prevent unnecessary tablet splits and migrations. This can be done when the table is created, or later on, using ALTER TABLE. 2. Controlling min_per_shard_tablet_count to improve tablet balancing, for hot tables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:55:51 +02:00
Benny Halevy	ad8b0649ff	feature_service: add TABLET_OPTIONS cluster schema feature To be used for enabling per-table tablet options. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:55:51 +02:00
Tomasz Grabiec	3bb19e9ac9	locator: network_topology_startegy: Ignore leaving nodes when computing capacity for new tables For example, nodes which are being decommissioned should not be consider as available capacity for new tables. We don't allocate tablets on such nodes. Would result in higher per-shard load then planned. Closes scylladb/scylladb#22657	2025-02-05 23:59:41 +02:00
Kefu Chai	9a20fb43ab	tree: replace boost::min_element() with std::ranges::min_element() in order to reduce the external header dependency, let's switch to the standardlized std::ranges::min_element(). Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22572	2025-02-05 21:54:01 +02:00
Botond Dénes	3d12451d1f	db/config: reader_concurrency_semaphore_cpu_concurrency: bump default to 2 This config item controls how many CPU-bound reads are allowed to run in parallel. The effective concurrency of a single CPU core is 1, so allowing more than one CPU-bound reads to run concurrently will just result in time-sharing and both reads having higher latency. However, restricting concurrency to 1 means that a CPU bound read that takes a lot of time to complete can block other quick reads while it is running. Increase this default setting to 2 as a compromise between not over-using time-sharing, while not allowing such slow reads to block the queue behind them. Fixes: #22450 Closes scylladb/scylladb#22679	2025-02-05 21:52:20 +02:00
Tomasz Grabiec	e22e3b21b1	locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes In that case, new_racks will be used, but when we discover no candidates, we try to pop from existing_racks. Fixes #22625 Closes scylladb/scylladb#22652	2025-02-05 20:13:05 +02:00
Nadav Har'El	bfdd805f15	test/alternator: fix running against installation blocking CQL One of the design goals of the Alternator test suite (test/alternator) is that developers should be able to run the tests against some already running installation by running `cd test/alternator; pytest [--url ...]`. Some of our presentations and documents recommend running Alternator via docker as: docker run --name scylla -d -p 8000:8000 scylladb/scylla:latest --alternator-port=8000 --alternator-write-isolation=always This only makes port 8000 available to the host - the CQL port is blocked. We had a bug in conftest.py's get_valid_alternator_role() which caused it to fail (and fail every single test) when CQL is not available. What we really want is that when CQL is not available and we can't figure out a correct secret key to connect to Alternator, we just try a connect with a fake key - and hope that the option alternator-enforce-authorization is turned off. In fact, this is what the code comments claim was already happening - but we failed to handle the case that CQL is not available at all. After this patch, one can run Alternator with the above docker command, and then run tests against it. By the way, this provides another way for running any old release of Scylla and running Alternator tests against it. We already supported a similar feature via test/alternator/run's "--release" option, but its implementation doesn't use docker. Fixes #22591 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#22592	2025-02-05 19:01:31 +03:00
Botond Dénes	7ce932ce01	service: query_pager: fix last-position for filtering queries On short-pages, cut short because of a tombstone prefix. When page-results are filtered and the filter drops some rows, the last-position is taken from the page visitor, which does the filtering. This means that last partition and row position will be that of the last row the filter saw. This will not match the last position of the replica, when the replica cut the page due to tombstones. When fetching the next page, this means that all the tombstone suffix of the last page, will be re-fetched. Worse still: the last position of the next page will not match that of the saved reader left on the replica, so the saved reader will be dropped and a new one created from scratch. This wasted work will show up as elevated tail latencies. Fix by always taking the last position from raw query results. Fixes: #22620 Closes scylladb/scylladb#22622	2025-02-05 17:23:30 +02:00
Avi Kivity	f3751f0eba	tools: toolchain: dbuild: don't use `which` command The `which` command is typically not installed on cloud OS images and so requires the user to remember to install it (or to be prompted by a failure to install it). Replace it with the built-in `type` that is always there. Wrap it in a function to make it clear what it does. Closes scylladb/scylladb#22594	2025-02-05 17:18:05 +03:00
Avi Kivity	1ef0a48bbe	conf: scylla.yaml: add stubs for encryption at rest These are helpful for configuring encryption-at-rest. Copied verbatim from scylla-enterprise. Closes scylladb/scylladb#22653	2025-02-05 17:14:53 +03:00
Raphael S. Carvalho	ce65164315	test: Use linux-aio backend again on seastar-based tests Since mid December, tests started failing with ENOMEM while submitting I/O requests. Logs of failed tests show IO uring was used as backend, but we never deliberately switched to IO uring. Investigation pointed to it happening accidentaly in commit `1bac6b75dc`, which turned on IO uring for allowing native tool in production, and picked linux-aio backend explicitly when initializing Scylla. But it missed that seastar-based tests would pick the default backend, which is io_uring once enabled. There's a reason we never made io_uring the default, which is that it's not stable enough, and turns out we made the right choice back then and it apparently continue to be unstable causing flakiness in the tests. Let's undo that accidental change in tests by explicitly picking the linux-aio backend for seastar-based tests. This should hopefully bring back stability. Refs #21968. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#22695	2025-02-05 15:19:24 +02:00
Ernest Zaslavsky	29e60288de	docs: update the `object_storage.md` and `admin.rst` Added additional options and best practices for AWS authentication.	2025-02-05 14:57:19 +02:00
Ernest Zaslavsky	dee4fc7150	aws creds: add STS and Instance Metadata service credentials providers This commit introduces two new credentials providers: STS and Instance Metadata Service. The S3 client's provider chain has been updated to incorporate these new providers. Additionally, unit tests have been added to ensure coverage of the new functionality.	2025-02-05 14:57:19 +02:00
Ernest Zaslavsky	d534051bea	aws creds: add env. and file credentials providers This commit entirely removes credentials from the endpoint configuration. It also eliminates all instances of manually retrieving environment credentials. Instead, the construction of file and environment credentials has been moved to their respective providers. Additionally, a new aws_credentials_provider_chain class has been introduced to support chaining of multiple credential providers.	2025-02-05 14:57:19 +02:00
Kefu Chai	f7a729c3fd	github: use clang-21 in clang-nightly workflow since clang 20 has been branched. let's track the development brach, which is clang 21. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22698	2025-02-05 14:58:35 +03:00
Aleksandra Martyniuk	fe02555c46	tasks: drop task_manager::config::broadcast_address as it's unused	2025-02-05 10:11:54 +01:00
Aleksandra Martyniuk	e16b413568	tasks: replace ip with host_id in task_identity Replace ip with host_id in task_identity. Translate host_id to ip in task manager api handlers. Use host_id in send_tasks_get_children.	2025-02-05 10:11:52 +01:00
Aleksandra Martyniuk	0c868870b4	api: task_manager: pass gossiper to api::set_task_manager Pass gossiper to api::set_task_manager. It will be used later for host_id to ip transition.	2025-02-05 10:10:29 +01:00
Aleksandra Martyniuk	4470c2f6d3	tasks: keep host_id in task_manager Keep host_id of a node in task manager. If host_id wasn't resolved yet, task manager will keep an empty id. It's a preparation for the following changes.	2025-02-05 10:10:29 +01:00
Aleksandra Martyniuk	7969e98b4e	tasks: move tasks_get_children to IDL	2025-02-05 10:10:29 +01:00
Kefu Chai	3aeecd4264	generic_server: correct typo in comment s/invokation/invocation/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22697	2025-02-05 11:48:50 +03:00
Andrzej Jackowski	6f5ba3dd89	test: SIGHUP doesn't change non-liveupdatable configuration This change: - Implement test to confirm that only liveupdatable parameters are changed when SIGHUP is sent after configuration file change	2025-02-05 09:37:37 +01:00
Andrzej Jackowski	a001b20938	test: implement ScyllaRESTAPIClient::get_config() This change: - Implement ScyllaRESTAPIClient::get_config() that obtains a current value of given configuration parameter via /v2/config REST API	2025-02-05 09:37:37 +01:00
Andrzej Jackowski	dd899c0f1f	config: prevent SIGHUP from changing non-liveupdatable parameters Before this change, it was possible to change non-liveupdatable config parameter without process restart. This erroneous behavior not only contradicts the documentation but is potentially dangerous, as various components theoretically might not be prepared for a change of configuration parameter value without a restart. The issue came from a fact that liveupdatability verification check was skipped for default configuration parameters (those without its initial values in configuration file during process start). This change: - Introduce _initialization_completed member in config_file - Set _initialization_completed=true when config file is processed on server start - Verify config_file's initialization status during config update - if config_file was initialized, prevent from further changes of non-liveupdatable parameters Fixes scylladb/scylladb#5382	2025-02-05 09:37:30 +01:00
Pavel Emelyanov	83f3821f99	Merge 'cql: clean the code validating replication strategy options' from Piotr Smaron Clean the code validating if a replication strategy can be used. This PR consists of a bunch of unmerged https://github.com/scylladb/scylladb/pull/20088 commits - the solution to the problem that the linked PR tried to solve has been accomplished in another PR, leaving the refactor commits unmerged. The commits introduced in this PR have already been reviewed in the old PR. No need to backport, it's just a refactor. Closes scylladb/scylladb#22516 * github.com:scylladb/scylladb: cql: restore validating replication strategies options cql: change validating NetworkTopologyStrategy tags to internal_error cql: inline abstract_replication_strategy::validate_replication_strategy cql: clean redundant code validating replication strategy options	2025-02-05 11:18:50 +03:00
Jenkins Promoter	9add2ccc41	Update pgo profiles - x86_64	2025-02-05 08:44:54 +02:00
Jenkins Promoter	c7660e5962	Update pgo profiles - aarch64	2025-02-05 07:51:46 +02:00
Ferenc Szili	a59618e83d	truncate: create session during request handling Currently, the session ID under which the truncate for tablets request is running is created during the request creation and queuing. This is a problem because this could overwrite the session ID of any ongoing operation on system.topology#session This change moves the creation of the session ID for truncate from the request creation to the request handling. Fixes #22613 Closes scylladb/scylladb#22615	2025-02-04 22:11:24 +01:00
Botond Dénes	f2d5819645	reader_concurrency_semaphore: with_permit(): proper clean-up after queue overload with_permit() creates a permit, with a self-reference, to avoid attaching a continuation to the permit's run function. This self-reference is used to keep the permit alive, until the execution loop processes it. This self reference has to be carefully cleared on error-paths, otherwise the permit will become a zombie, effectively leaking memory. Instead of trying to handle all loose ends, get rid of this self-reference altogether: ask caller to provide a place to save the permit, where it will survive until the end of the call. This makes the call-site a little bit less nice, but it gets rid of a whole class of possible bugs. Fixes: #22588 Closes scylladb/scylladb#22624	2025-02-04 21:27:16 +02:00
Ernest Zaslavsky	c911fc4f34	s3 creds: move credentials out of endpoint config This commit refactors the way AWS credentials are managed in Scylla. Previously, credentials were included in the endpoint configuration. However, since credentials and endpoint configurations serve different purposes and may have different lifetimes, it’s more logical to manage them separately. Moving forward, credentials will be completely removed from the endpoint_config to ensure clear separation of concerns.	2025-02-04 16:45:23 +02:00
Andrzej Jackowski	fb118bfd3b	config: remove unused set_value_on_all_shards(const YAML::Node&) This change: - Remove unused set_value_on_all_shards(const YAML::Node&) member function in class config_file::named_value The function logic was flawed, in a similar way named_value<T>::set_value(const YAML::Node& node) is flawed: the config source verification is insufficient for liveupdatable parameters, allowing overwriting of non-liveupdatable config parameters (refer to scylladb#5382). As the function was not used, it was removed instead of fixing.	2025-02-04 15:09:23 +01:00
Michał Chojnowski	bea434f417	pgo: disable tablets for training with secondary index, lwt and counters As of right now, materialized views (and consequently secondary indexes), lwt and counters are unsupported or experimental with tablets. Since by defaults tablets are enabled, training cases using those features are currently broken. The right thing to do here is to disable tablets in those cases. Fixes https://github.com/scylladb/scylladb/issues/22638 Closes scylladb/scylladb#22661	2025-02-04 15:38:53 +02:00
Piotr Smaron	2953d3ebe0	cql: restore validating replication strategies options `validate_options` needs to be extended with `topology` parameter, because NetworkTopologyStrategy needs to validate if every explicitly listed DC is really existing. I did cut corner a bit and trimmed the message thrown when it's not the case, just to avoid passing and extra parameter (ks name) to the `validate_options` function, as I find the longer message to be a bit redundant (the driver will receive info which KS modification failed). The tests that have been commented out in the previous commit have been restored.	2025-02-04 12:27:33 +01:00
Piotr Smaron	100e8d2856	cql: change validating NetworkTopologyStrategy tags to internal_error The check for `replication_factor` tag in `network_topology_strategy::validate_options` is redundant for 2 reasons: - before we reach this part of the code, the `replication_factor` tag is replaced with specific DC names - we actually do allow for `replication_factor` tag in NetworkTopologyStrategy for keyspaces that have tablets disabled. This code is unreachable, hence changing it to an internal error, which means this situation should never occur. The place that unrolls `replication_factor` tag checked for presence of this tag ignoring the case, which lead to an unexpected behaviour: - `replication_factor` tag (note the lowercase) was unrolled, as explained above, - the same tag but written in any other case resulted in throwing a vague message: "replication_factor is an option for SimpleStrategy, not NetworkTopologyStrategy". So we're changing this validation to accept and unroll only the lowercase version of this tag. We can't ignore the case here, as this tag is present inside a json, and json is case-sensitive, even though the CQL itself is case insensitive. Added a test that passes for both scylla and cassandra. Fixes: #15336	2025-02-04 12:27:29 +01:00
Aleksandra Martyniuk	683176d3db	tasks: add shard, start_time, and end_time to task_stats task_stats contains short info about a task. To get a list of task_stats in the module, one needs to request /task_manager/list_module_tasks/{module}. To make identification and navigation between tasks easier, extend task_stats to contain shard, start_time, and end_time. Closes scylladb/scylladb#22351	2025-02-04 12:11:24 +02:00
Botond Dénes	8c8db2052e	Merge 'service: add child for tablet repair virtual task' from Aleksandra Martyniuk tablet_repair_task_impl is run as a part of tablet repair. Make it a child of tablet repair virtual task. tablet_repair_task_impl started by /storage_service/repair_async API (vnode repair) does not have a parent, as it is the top-level task in that case. No backport needed; new functionality Closes scylladb/scylladb#22372 * github.com:scylladb/scylladb: test: add test to check tablet repair child service: add child for tablet repair virtual task	2025-02-04 12:08:24 +02:00
Aleksandra Martyniuk	610a761ca2	service: use read barrier in tablet_virtual_task::contains Currently, when the tablet repair is started, info regarding the operation is kept in the system.tablets. The new tablet states are reflected in memory after load_topology_state is called. Before that, the data in the table and the memory aren't consistent. To check the supported operations, tablet_virtual_task uses in-memory tablet_metadata. Hence, it may not see the operation, even though its info is already kept in system.tablets table. Run read barrier in tablet_virtual_task::contains to ensure it will see the latest data. Add a test to check it. Fixes: #21975. Closes scylladb/scylladb#21995	2025-02-04 12:07:42 +02:00
Avi Kivity	6913f054e7	Update tools/cqlsh submodule The driver update makes cqlsh work well with Python 3.13. * tools/cqlsh 52c6130...02ec7c5 (18): > chore(deps): update dependency scylla-driver to v3.28.2 > dist: support smooth upgrade from enterprise to source availalbe > github action: fix downloading of artifacts > chore(deps): update docker/setup-buildx-action action to v3 > chore(deps): update docker/login-action action to v3 > chore(deps): update docker/build-push-action action to v6 > chore(deps): update docker/setup-qemu-action action to v3 > chore(deps): update peter-evans/dockerhub-description action to v4 > upload actions: update the usage for multiple artifacts > chore(deps): update actions/download-artifact action to v4.1.8 > chore(deps): update dependency scylla-driver to v3.28.0 > chore(deps): update pypa/cibuildwheel action to v2.22.0 > chore(deps): update actions/checkout action to v4 > chore(deps): update python docker tag to v3.13 > chore(deps): update actions/upload-artifact action to v4 > github actions: update it to work > add option to output driver debug > Add renovate.json (#107) Closes scylladb/scylladb#22593	2025-02-04 12:06:54 +02:00
Avi Kivity	f25636884a	api: storage_service: break out set_storage_service lambdas into free functions This was originally an attempt to reduce the compile time of this translation unit, but apparently it doesn't work. Still, it has the effect of converting stack traces that say "set_storage_service" and refer to some lambda to stack traces that refer to the operation being performed, so it's a net positive. To faciliate the change, we introduce new functions rest_bind(), similar to (and in fact wrapping) std::bind_front(), that capture references like the lambdas did originally. We can't use std::bind_front directly since the call to seastar::httpd::path_description::set() cannot be disambiguated after the function is obscured by the template returned by std::bind_front. The new function rest_bind() has constraints to understand which overload is in use. Closes scylladb/scylladb#22526	2025-02-04 12:06:18 +02:00
Ran Regev	edd56a2c1c	moved cache files to db As requested in #22097, moved the files and fixed other includes and build system. Fixes: #22097 Signed-off-by: Ran Regev <ran.regev@scylladb.com> Closes scylladb/scylladb#22495	2025-02-04 12:21:31 +03:00
Pavel Emelyanov	e47c7d5255	Merge 'config: Improve internode_compression option validation and documentation' from Kefu Chai This PR enhances the internode_compression configuration option in two ways: 1. Add validation for option values Previously, we silently defaulted to 'none' when given invalid values. Now we explicitly validate against the three supported values (all, dc, none) and reject invalid inputs. This provides better error messages when users misconfigure the option. 2. Fix documentation rendering The help text for this option previously used C++ escape sequences which rendered incorrectly in Sphinx-generated HTML. We now use bullet points with '' prefix to list the available values, matching our documentation style for other config options. This ensures consistent rendering in both CLI and HTML outputs. Note: The current documentation format puts type/default/liveness information in the same bullet list as option values. This affects other config options as well and will need to be addressed in a separate change. --- this improves the handling of invalid option values, and improves the doc rendering, neither of which is critical. hence no need to backport. Closes scylladb/scylladb#22548 github.com:scylladb/scylladb: config: validate internode_compression option values config: start available options with '*'	2025-02-04 10:17:23 +03:00
Andrei Chekun	2a99494752	test.py: Remove workaround for python bug in asyncio Bug https://bugs.python.org/issue26789 is resolved in python 3.10. The frozen tool chain uses python 3.12. Since this is a supported and recommended way for work environment, removing workaround and bumping requirements for a newer python version. Closes scylladb/scylladb#22627	2025-02-03 22:27:34 +02:00
David Garcia	fe4750ffc3	docs: fetch multiverson config from remote sources fix: brand Closes scylladb/scylladb#22616	2025-02-03 15:25:10 +02:00
Yaron Kaikov	4f832c31b9	.github/workflows/make-pr-ready-for-review: add missing permissions Following the work done in `ed4bfad5c3`, the action is failing with the following error: ``` Error: Input required and not supplied: token ``` It is due ot missing permissions in the workflow, adding it Closes scylladb/scylladb#22630	2025-02-03 13:25:27 +02:00
Gleb Natapov	fe45ea505b	topology_coordinator: demote barrier_and_drain rpc failure to warning The failure may happen during normal operation as well (for instance if leader changes). Fixes: scylladb/scylladb#22364	2025-02-03 13:09:58 +02:00
Gleb Natapov	1da7d6bf02	topology_coordinator: read peers table only once during topology state application During topology state application peers table may be updated with the new ip->id mapping. The update is not atomic: it adds new mapping and then removes the old one. If we call get_host_id_to_ip_map while this is happening it may trigger an internal error there. This is a regression since `ef929c5def`. Before that commit the code read the peers table only once before starting the update loop. This patch restores the behaviour. Fixes: scylladb/scylladb#22578	2025-02-03 13:09:18 +02:00
Aleksandra Martyniuk	43427b8fe0	test: add test to check tablet repair child	2025-02-03 10:31:16 +01:00
Aleksandra Martyniuk	c23ce40f50	service: add child for tablet repair virtual task tablet_repair_task_impl is run as a part of tablet repair. Make it a child of tablet repair virtual task. tablet_repair_task_impl started by /storage_service/repair_async API (vnode repair) does not have a parent, as it is the top-level task in that case.	2025-02-03 10:31:14 +01:00
Avi Kivity	d237d0a4ea	Update seastar submodule * seastar 71036ebcc0...5b95d1d798 (3): > rpc stream: do not abort stream queue if stream connection was closed without error > resource: fallback to sysconf when failed to detect memory size from hwloc > Merge 'scheduling_group: improve scheduling group creation exception safety' from Michael Litvak scylla-gdb.py adjusted for scheduling_group_specific data structure changes in Seastar. As part of that, a gratuitous dereference of std::unique_ptr, which fails for std::unique_ptr<void*, ...>, was removed.	2025-02-03 00:10:38 +02:00
Botond Dénes	e1b1a2068a	reader_concurrency_semaphore: foreach_permit(): include _inactive_reads So inactive reads show up in semaphore diagnostics dumps (currently the only non-test user of this method). Fixes: #22574 Closes scylladb/scylladb#22575	2025-01-30 22:46:57 +02:00
Michael Litvak	44c06ddfbb	test/test_view_build_status: fix wrong assert in test The test expects and asserts that after wait_for_view is completed we read the view_build_status table and get a row for each node and view. But this is wrong because wait_for_view may have read the table on one node, and then we query the table on a different node that didn't insert all the rows yet, so the assert could fail. To fix it we change the test to retry and check that eventually all expected rows are found and then eventually removed on the same host. Fixes scylladb/scylladb#22547 Closes scylladb/scylladb#22585	2025-01-30 21:25:53 +02:00
Michael Litvak	6d34125eb7	view_builder: fix loop in view builder when tokens are moved The view builder builds a view by going over the entire token ring, consuming the base table partitions, and generating view updates for each partition. A view is considered as built when we complete a full cycle of the token ring. Suppose we start to build a view at a token F. We will consume all partitions with tokens starting at F until the maximum token, then go back to the minimum token and consume all partitions until F, and then we detect that we pass F and complete building the view. This happens in the view builder consumer in `check_for_built_views`. The problem is that we check if we pass the first token F with the condition `_step.current_token() >= it->first_token` whenever we consume a new partition or the current_token goes back to the minimum token. But suppose that we don't have any partitions with a token greater than or equal to the first token (this could happen if the partition with token F was moved to another node for example), then this condition will never be satisfied, and we don't detect correctly when we pass F. Instead, we go back to the minimum token, building the same token ranges again, in a possibly infinite loop. To fix this we add another step when reaching the end of the reader's stream. When this happens it means we don't have any more fragments to consume until the end of the range, so we advance the current_token to the end of the range, simulating a partition, and check for built views in that range. Fixes scylladb/scylladb#21829 Closes scylladb/scylladb#22493	2025-01-30 14:35:18 +02:00
Nikos Dragazis	439862a8d4	test/cqlpy: Reproduce bug with exceeded limit on secondary index Add two cqlpy tests that reproduce a bug where a secondary index query returns more rows than the specified limit. This occurs when the indexed column is a partition key column or the first clustering key column, the query result spans multiple partitions, and the last partition causes the limit to be exceeded. `test/cqlpy/run --release ...` shows that the tests fail for Scylla versions all the way back to 4.4.0. Older Scylla versions fail with a syntax error in CQL query which suggests some incompatibility in the CQL protocol. That said, this bug is not a regression. The tests pass in Cassandra 5.0.2. Refs #22158. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#22513	2025-01-30 13:24:15 +02:00
Kefu Chai	f39cfd8eb0	compaction: switch boost::algorithm::any_of to std::ranges::any_of std::any_of was included by C++11, and boost::algorithm::any_of() is provided by Boost for users stuck in the pre-C++11 era. in our case, we've moved into C++23, where the ranges variant of this algorithm is available. in order to reduce the header dependency, let's switch to `std::ranges::any_of()`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22503	2025-01-30 13:22:33 +02:00
Artsiom Mishuta	03606b8e22	test.py:topology_random_failures: enable tests deselected for #21534 removed tests deselectios for issue scylladb/scylladb#21534 as it closed now fixes: scylladb/scylladb#21711 Closes scylladb/scylladb#22424	2025-01-30 12:12:19 +01:00
Wojciech Mitros	677f9962cf	mv: forbid views with tablets by default Materialized views with tablets are not stable yet, but we want them available as an experimental feature, mainly for teseting. The feature was added in scylladb/scylladb#21833, but currently it has no effect. All tests have been updated to use the feature, so we should finally make it work. This patch prevents users from creating materialized views in keyspaces using tablets when the VIEWS_WITH_TABLETS feature is not enabled - such requests will now get rejected. Fixes scylladb/scylladb#21832 Closes scylladb/scylladb#22217	2025-01-30 12:10:47 +01:00
aberry-21	69a0431cce	schema: add validation for PERCENTILE values in `speculative_retry` configuration This commit addresses issue #21825, where invalid PERCENTILE values for the `speculative_retry` setting were not properly handled, causing potential server crashes. The valid range for PERCENTILE is between 0 and 100, as defined in the documentation for speculative retry options, where values above 100 or below 0 are invalid and should be rejected. The added validation ensures that such invalid values are rejected with a clear error message, improving system stability and user experience. Fixes #21825 Closes scylladb/scylladb#21879	2025-01-30 11:34:46 +02:00
Yaron Kaikov	ed4bfad5c3	.github: add action to make PR ready for review when conflicts label was removed Moving a PR out of draft is only allowed to users with write access, adding a github action to switch PR to `ready for review` once the `conflicts` label was removed Closes scylladb/scylladb#22446	2025-01-30 11:33:25 +02:00
Nadav Har'El	698a63e14b	test/alternator: test for invalid B value in UpdateItem This patch adds an Alternator test for the case of UpdateItem attempting to insert in invalid B (bytes) value into an item. Values of type B use base64 encoding, and an attempt to insert a value which isn't valid base64 should be rejected, and this is what this test verifies. The new tests reproduce issue #17539, which claimed we have a bug in this area. However, test/alternator/run with the "--release" option shows that this bug existed in Scylla 5.2, but but fixed long ago, in 5.3 and doesn't exist in master. But we never had a regression test this issue, so now we do. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#22029	2025-01-30 11:33:03 +02:00
Botond Dénes	af46894bb7	Merge 'Rack aware view pairing' from Benny Halevy Enabled with the tablets_rack_aware_view_pairing cluster feature rack-aware pairing pairs base to view replicas that are in the same dc and rack, using their ordinality in the replica map We distinguish between 2 cases: - Simple rack-aware pairing: when the replication factor in the dc is a multiple of the number of racks and the minimum number of nodes per rack in the dc is greater than or equal to rf / nr_racks. In this case (that includes the single rack case), all racks would have the same number of replicas, so we first filter all replicas by dc and rack, retaining their ordinality in the process, and finally, we pair between the base replicas and view replicas, that are in the same rack, using their original order in the tablet-map replica set. For example, nr_racks=2, rf=4: base_replicas = { N00, N01, N10, N11 } view_replicas = { N11, N12, N01, N02 } pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 } Note that we don't optimize for self-pairing if it breaks pairing ordinality. - Complex rack-aware pairing: when the replication factor is not a multiple of nr_racks. In this case, we attempt best-match pairing in all racks, using the minimum number of base or view replicas in each rack (given their global ordinality), while pairing all the other replicas, across racks, sorted by their ordinality. For example, nr_racks=4, rf=3: base_replicas = { N00, N10, N20 } view_replicas = { N11, N21, N31 } pairing would be: { N00, N31 }\, { N10, N11 }, { N20, N21 } \ cross-rack pair If we'd simply stable-sort both base and view replicas by rack, we might end up with much worse pairing across racks: { N00, N11 }\, { N10, N21 }\, { N20, N31 }\* \* cross-rack pair Fixes scylladb/scylladb#17147 * This is an improvement so no backport is required Closes scylladb/scylladb#21453 * github.com:scylladb/scylladb: network_topology_strategy_test: add tablets rack_aware_view_pairing tests view: get_view_natural_endpoint: implement rack-aware pairing for tablets view: get_view_natural_endpoint: handle case when there are too few view replicas view: get_view_natural_endpoint: track replica locator::nodes locator: topology: consult local_dc_rack if node not found by host_id locator: node: add dc and rack getters feature_service: add tablet_rack_aware_view_pairing feature view: get_view_natural_endpoint: refactor predicate function view: get_view_natural_endpoint: clarify documentation view: mutate_MV: optimize remote_endpoints filtering check view: mutate_MV: lookup base and view erms synchronously view: mutate_MV: calculate keyspace-dependent flags once	2025-01-30 11:32:19 +02:00
Aleksandra Martyniuk	328818a50f	replica: mark registry entry as synch after the table is added When a replica get a write request it performs get_schema_for_write, which waits until the schema is synced. However, database::add_column_family marks a schema as synced before the table is added. Hence, the write may see the schema as synced, but hit no_such_column_family as the table hasn't been added yet. Mark schema as synced after the table is added to database::_tables_metadata. Fixes: #22347. Closes scylladb/scylladb#22348	2025-01-30 11:30:07 +02:00
Aleksandra Martyniuk	477ad98b72	nodetool: tasks: print empty string for start_time/end_time if unspecified If start_time/end_time is unspecified for a task, task_manager API returns epoch. Nodetool prints the value in task status. Fix nodetool tasks commands to print empty string for start_time/end_time if it isn't specified. Modify nodetool tasks status docs to show empty end_time. Fixes: #22373. Closes scylladb/scylladb#22370	2025-01-30 11:29:36 +02:00
Calle Wilund	7db14420b7	encryption: Fix encrypted components mask check in describe Fixes #22401 In the fix for scylladb/scylla-enterprise#892, the extraction and check for sstable component encryption mask was copied to a subroutine for description purposes, but a very important 1 << <value> shift was somehow left on the floor. Without this, the check for whether we actually contain a component encrypted can be wholly broken for some components. Closes scylladb/scylladb#22398	2025-01-30 11:29:13 +02:00
Botond Dénes	8e89f2e88e	Merge 'audit: make categories, tables, and keyspaces liveupdatable' from Andrzej Jackowski This change: - Remove code that prevented audit from starting if audit_categories, audit_tables, and audit_keyspaces are not configured - Set liveness::LiveUpdate for audit_categories, audit_tables, and audit_keyspaces - Keep const reference to db::config in audit, so current config values can be obtained by audit implementation - Implement function audit::update_config to parse given string, update audit datastructures when needed, and log the changes. - Add observers to call audit::update_config when categories, tables, or keyspaces configuration changes New functionality, so no backport needed. Fixes https://github.com/scylladb/scylla-enterprise/issues/1789 Closes scylladb/scylladb#22449 * github.com:scylladb/scylladb: audit: make categories, tables, and keyspaces liveupdatable audit: move static parsing functions above audit constructors audit: move statement_category to string conversion to static function audit: start audit even with empty categories/tables/keyspaces	2025-01-30 11:28:49 +02:00
Botond Dénes	d8b8a6c5fc	Merge 'api: task_manager: do not unregister finish task when its status is queried' from Aleksandra Martyniuk Currently, when the status of a task is queried and the task is already finished, it gets unregistered. Getting the status shouldn't be a one-time operation. Stop removing the task after its status is queried. Adjust tests not to rely on this behavior. Add task_manager/drain API and nodetool tasks drain command to remove finished tasks in the module. Fixes: https://github.com/scylladb/scylladb/issues/21388. It's a fix to task_manager API, should be backported to all branches Closes scylladb/scylladb#22310 * github.com:scylladb/scylladb: api: task_manager: do not unregister tasks on get_status api: task_manager: add /task_manager/drain	2025-01-30 11:27:44 +02:00
Botond Dénes	98fdf05b0e	Merge 'Fix repair vs storage services initialization order' from Pavel Emelyanov Repair service is started after storage service, while storage service needs to reference repair one for its needs. Recently it was noticed, that this reverse order may cause troubles and was fixed with the help of an extra gate. That's not nice and makes the start-stop mess even worse. The correct fix is to fix the order both services start/stop in. Closes scylladb/scylladb#22368 * github.com:scylladb/scylladb: Revert "repair: add repair_service gate" main: Start repair before storage service repair: Check for sharded<view-builder> when constructing row_level_repair	2025-01-30 11:26:24 +02:00
Nadav Har'El	98a8ae0552	test/alternator: functional tests for Alternator multi-item transactions This patch adds extensive functional tests for the DynamoDB multi-item transactions feature - the TransactWriteItems and TransactGetItems requests. We add 43 test functions, spanning more than 1000 lines of code, covering the different parameters and corner cases of these requests. Because we don't support the transaction feature in Alternator yet (this is issue #5064), all of these tests fail on Alternator but all of them were tested to pass on DynamoDB. So all new tests are marked "xfail". These tests will be handy for whoever will implement this feature as an acceptance test, and can also be useful for whoever will just want to understand this feature better - the tests are short and simple and heavily commented. Note that these tests only check the correct functionality of individual calls of these requests - these tests cannot and do not check the consistency or isolation guarantees of concurrent invocations of several requests. Such tests would require a different test framework, such as the one requested in issue #6350, and are therefore not part of this patch. Note that this patch includes ONLY tests, and does not mean that an implementation of the feature will soon follow. In fact, nobody is currently working on implementing this feature. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#22239	2025-01-30 11:22:05 +02:00
Avi Kivity	000791ad5c	README: adjust to reflect license change Adjust the contact section to reflect the license change. Closes scylladb/scylladb#22537	2025-01-30 10:28:32 +03:00
Kamil Braun	febd45861e	test/lib: cql_test_env: make service shutdown more verbose Introduce `defer_verbose_shutdown` in `cql_test_env` which logs a message before and after shutting down a service, distinguishing between success and failure. The function is similar to the one in `main` but skips special error handling logic applicable only to the main Scylla binary. The purpose of the `cql_test_env` version of this function is only more verbose logging. If necessary it can be extended in the future with additional logic. I estimated the impact on the size of produced log files using `cdc_test` as an example: ``` $ build/dev/test/boost/combined_tests --run_test=cdc_test -- --smp=2 \ >logfile 2>&1 $ du -b logfile ``` the result before this commit: 1964064 bytes, after: 2196432 bytes, so estimated ~12% increase of log file size for boost tests that use `cql_test_env`, assuming that the number of logs printed by each test is similar to the logs printed by `cdc_test` (but I believe `cdc_test` is one of the less verbose tests so this is an overestimate). The motivation for this change is easier debugging of shutdown issues. When investigating scylladb/scylladb#21983, where an exception is thrown somewhere during the shutdown procedure, I found it hard to pinpoint the service from which the exception originates. This change will make it easier to debug issues like that by wrapping shutdown of each service in a pair of messages logged when shutdown starts and when it finishes (including when it fails). We should get more details on this issue when it reproduces again in CI after this commit is merged into `master`. (I failed to reproduce it locally with 1000 runs.) Ref scylladb/scylladb#21983 Closes scylladb/scylladb#22566	2025-01-30 10:27:45 +03:00
Botond Dénes	5dd6fcfe6f	Merge 'encrypted_file_impl: Check for reads on or past actual file length in transform' from Calle Wilund Fixes #22236 If reading a file and not stopping on block bounds returned by `size()`, we could allow reading from (_file_size+<1-15>) (if crossing block boundary) and try to decrypt this buffer (last one). Simplest example: Actual data size: 4095 Physical file size: 4095 + key block size (typically 16) Read from 4096: -> 15 bytes (padding) -> transform return `_file_size` - `read offset` -> wraparound -> rather larger number than we expected (not to mention the data in question is junk/zero). Check on last block in `transform` would wrap around size due to us being >= file size (l). Just do an early bounds check and return zero if we're past the actual data limit. Closes scylladb/scylladb#22395 * github.com:scylladb/scylladb: encrypted_file_test: Test reads beyond decrypted file length encrypted_file_impl: Check for reads on or past actual file length in transform	2025-01-29 20:09:32 +02:00
Anna Stuchlik	2a6445343c	doc: update the Web Installer docs to remove OSS Fixes https://github.com/scylladb/scylladb/issues/22292 Closes scylladb/scylladb#22433	2025-01-29 20:00:01 +02:00
Anna Stuchlik	caf598b118	doc: add SStable support in 2025.1 This commit adds the information about SStable version support in 2025.1 by replacing "2022.2" with "2022.2 and above". In addition, this commit removes information about versions that are no longer supported. Fixes https://github.com/scylladb/scylladb/issues/22485 Closes scylladb/scylladb#22486	2025-01-29 19:59:24 +02:00
dependabot[bot]	962bd452f5	build(deps): bump sphinx-scylladb-theme from 1.8.3 to 1.8.5 in /docs Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.3 to 1.8.5. - [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases) - [Commits](https://github.com/scylladb/sphinx-scylladb-theme/compare/1.8.3...1.8.5) --- updated-dependencies: - dependency-name: sphinx-scylladb-theme dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#22435	2025-01-29 19:57:33 +02:00
Pavel Emelyanov	4aff86ac64	sstables: Mark sstable::sstable_buffer_size const It really never changes once set in constructor Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#22533	2025-01-29 19:51:22 +02:00
Avi Kivity	c71f383cf2	Merge 'sstables: use std::variant instead of boost::variant' from Botond Dénes Continue replacing boost types with std one where possible. Improvement, no backport needed Closes scylladb/scylladb#22290 * github.com:scylladb/scylladb: sstables: disk_types: disk_set_of_tagged_union: boost::variant -> std::variant scylla-gdb.py: std_variant: fix get() sstables: disk_types: remove unused disk_tagged_union	2025-01-29 15:29:15 +02:00
Michał Chojnowski	80072eefe5	test/scylla_gdb: add more checks to coro_task() test_coro_frame is flaky, as if `service::topology_coordinator::run() [clone .resume]` wasn't running on the shard. But it's supposed to. Perhaps this is a bug in `find_vptrs()`? This patch asks `scylla find` for a second opinion, and also prints all `find_vptrs()`, to see if it's the only coroutine missing from there. Closes scylladb/scylladb#22534	2025-01-29 11:02:24 +02:00
Kefu Chai	e218a62a7a	cdc,index: replace boost::ends_with() with .ends_with() since C++20, std::string and std::string_view started providing `ends_with()` member function, the same applies to `seastar::sstring`, so there is no need to use `boost::ends_with()` anymore. in this change, we switch from `boost::ends_with()` to the member functions variant to - improve the readability - reduce the header dependency Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22502	2025-01-29 11:52:55 +03:00
Pavel Emelyanov	2eb06c6384	Update seastar submodule with recent IO scheduler improvement * seastar 18221366...71036ebc (2): > Merge 'fair_queue: make the fair_group token grabbing discipline more fair' from Michał Chojnowski apps/io_tester: add some test cases for the IO scheduler test: in fair_queue_test, ensure that tokens are only replenished by test_env fair_queue: track the total capacity of queued requests fair_queue: make the fair_group token grabbing discipline more fair > scheduling: auto-detect scheduling group key rename() method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#22504	2025-01-29 11:39:12 +03:00
Kefu Chai	8d0cabb392	config: validate internode_compression option values Previously, the internode_compression option silently defaulted to 'none' for any unrecognized value instead of validating input. It only compared against 'all' and 'dc', making it error-prone. Add explicit validation for the three supported values: - all - dc - none This ensures invalid values are rejected both in command line and YAML configuration, providing better error messages to users. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-29 14:52:35 +08:00
Kefu Chai	a81416862a	config: start available options with '' use '' prefix for config option values instead of escape sequences The custom Sphinx extension that generates documentation from config.cc help messages has issues with C++ escape sequences. For example, "\tall: All traffic" renders incorrectly as "tall: All traffic" in HTML output. Instead of using escape sequences, switch to bullet-point style with '*' prefix which works better in both CLI and HTML rendering. This matches our existing documentation style for available option values in other configs. Note: This change puts type/default/liveness info in the same bullet list as option values. This limitation affects other similar config options and will need to be addressed comprehensively in a future change. Refs scylladb/scylladb#22423 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-29 14:52:35 +08:00
Dawid Pawlik	a68bf6dcc1	docs: add vector type documentation Add missing vector type documentation including: definition of vector, adjustment of term definition, JSON encoding, Lua and cql3 type mapping, vector dimension limit, and keyword specification.	2025-01-28 21:14:49 +01:00
Jan Łakomy	947933366f	cassandra_tests: translate tests covering the vector type Add cql_vector_test which tests the basic functionalities of the vector type using CQL. Add vectors_test which tests if descending ordering of vector is supported.	2025-01-28 21:14:49 +01:00
Jan Łakomy	84c92837e0	type_codec: add vector type encoding This change has been introduced to enable CQL drivers to recognize vector type in query results. The encoding has been imported from Apache Cassandra implementation to match Cassandra's and latest drivers' behaviour. Co-authored-by: Dawid Pawlik <501149991dp@gmail.com>	2025-01-28 21:14:49 +01:00
Dawid Pawlik	489ab1345e	boost/expr_test: add vector expression tests Add and adjust tests using vector and list_or_vector style types. Implemented utilities used in expr_test similar to those added in `8f6309bd66`.	2025-01-28 21:14:49 +01:00
Dawid Pawlik	ed49093a01	expression: adjust collection constructor list style Like mentioned in the previous commit, this changes introduce usage of vector style type and adjusts the functions using list style type to distinguish vectors from lists. Rename collection constructor style list to list_or_vector.	2025-01-28 21:14:49 +01:00
Dawid Pawlik	69c754f0d4	expression: add vector style type Motivation for this changes is to provide a distinguishable interface for vector type expressions. The square bracket literal is ambigious for lists and vectors, so that we need to perform a distinction not using CQL layer. At first we should use the collection constructor to manage both lists and vectors (although a vector is not a collection). Later during preparation of expressions we should be able to get to know the exact type using given receiver (column specification). Knowing the type of expression we may use their respective style type (in this case the vector style type being introduced), which would make the implementation more precise and allow us to evaluate the expressions properly. This commit introduces vector style type and functions making use of it. However vector style type is not yet used anywhere, the next commit should adjust collection constructor and make use of the new vector style type and it's features.	2025-01-28 21:14:49 +01:00
Dawid Pawlik	7554e55c2c	test/boost: add vector type cql_env boost tests These tests check serialization and deserialization (including JSON), basic inserts and selects, aggregate functions, element validation, vector usage in user defined types and functions. test_vector_between_user_types is a translated Apache Cassandra test to check if it is handled properly internally.	2025-01-28 21:14:49 +01:00
Dawid Pawlik	c3a1760a44	test/boost: add vector type_parser tests Contains two type_parser tests: one for a valid vector and another for invalid vector.	2025-01-28 21:14:49 +01:00
Dawid Pawlik	9954eb0ed7	type_parser: support vector type This change is introduced due to lack of support for vector class name, used by type_parser to create data_type based on given class name (especially compound class name with inner types or other parameters). Add function that parses vector type parameters from a class name.	2025-01-28 21:14:49 +01:00
Dawid Pawlik	aac10d261c	cql3: add vector type syntax Introduce vector_type CQL syntax: VECTOR<`cql_type`, `integer`>. The parameters are respectively a type of elements of the vector and the vector's dimension (number of elements). Co-authored-by: Jan Łakomy <janpiotrlakomy@gmail.com>	2025-01-28 21:14:49 +01:00
Michael Litvak	4f5550d7f2	cdc: fix handling of new generation during raft upgrade During raft upgrade, a node may gossip about a new CDC generation that was propagated through raft. The node that receives the generation by gossip may have not applied the raft update yet, and it will not find the generation in the system tables. We should consider this error non-fatal and retry to read until it succeeds or becomes obsolete. Another issue is when we fail with a "fatal" exception and not retrying to read, the cdc metadata is left in an inconsistent state that causes further attempts to insert this CDC generation to fail. What happens is we complete preparing the new generation by calling `prepare`, we insert an empty entry for the generation's timestamp, and then we fail. The next time we try to insert the generation, we skip inserting it because we see that it already has an entry in the metadata and we determine that there's nothing to do. But this is wrong, because the entry is empty, and we should continue to insert the generation. To fix it, we change `prepare` to return `true` when the entry already exists but it's empty, indicating we should continue to insert the generation. Fixes scylladb/scylladb#21227 Closes scylladb/scylladb#22093	2025-01-28 18:05:32 +01:00
Kamil Braun	add97ccc15	Merge 'Do not update topology on address change' from Gleb Natapov Since now topology does not contain ip addresses there is no need to create topology on an ip address change. Only peers table has to be updated. The series factors out peers table update code from sync_raft_topology_nodes() and calls it on topology and ip address updates. As a side effect it fixes #22293 since now topology loading does not require IP do be present, so the assert that is triggered in this bug is removed. Fixes: scylladb/scylladb#22293 Closes scylladb/scylladb#22519 * github.com:scylladb/scylladb: topology coordinator: do not update topology on address change topology coordinator: split out the peer table update functionality from raft state application	2025-01-28 12:52:29 +01:00
Avi Kivity	7f2d901c89	Merge 'repair: handle no_such_keyspace in repair preparation phase' from Aleksandra Martyniuk Currently, data sync repair handles most no_such_keyspace exceptions, but it omits the preparation phase, where the exception could be thrown during make_global_effective_replication_map. Skip the keyspace repair if no_such_keyspace is thrown during preparations. Fixes: #22073. Requires backport to 6.1 and 6.2 as they contain the bug Closes scylladb/scylladb#22473 * github.com:scylladb/scylladb: test: add test to check if repair handles no_such_keyspace repair: handle keyspace dropped	2025-01-28 13:42:38 +02:00
Pavel Emelyanov	4d8d7f1f1d	Merge 'backup_task: remove a component once it is uploaded ' from Kefu Chai Previously, during backup, SSTable components are preserved in the snapshot directory even after being uploaded. This leads to redundant uploads in case of failed backups or restarts, wasting time and resources (S3 API calls). This change removes SSTable components from the snapshot directory once they are successfully uploaded to the target location. This prevents re-uploading the same files and reduces disk usage. This change only "Refs" https://github.com/scylladb/scylladb/issues/20655, because, we can further optimize the backup process, consider: - Sending HEAD requests to S3 to check for existing files before uploading. - Implementing support for resuming partially uploaded files. Fixes https://github.com/scylladb/scylladb/issues/21799 Refs https://github.com/scylladb/scylladb/issues/20655 --- the backup API is not used in production yet, so no need to backport. Closes scylladb/scylladb#22285 * github.com:scylladb/scylladb: backup_task: remove a component once it is uploaded backup_task: extract component upload logic into dedicated function snapshot-ctl: change snapshot_ctl::run_snapshot_modify_operation() to regular func	2025-01-28 14:27:50 +03:00
Kefu Chai	5da2691f05	cmake: remove redundant BINARY_DIR setting for Seastar ExternalProject automatically creates BINARY_DIR for Seastar, but generator expressions are not supported in this setting. This caused CMake to create an unused "build/$<CONFIG>/seastar" directory. Instead, define a dedicated variable matching configure.py's naming and use it in supported options like BUILD_COMMAND. This: - Creates build files in the standard "Seastar-prefix/src/Seastar-build" directory instead of "build/$<CONFIG>/seastar". see https://cmake.org/cmake/help/latest/module/ExternalProject.html#directory-options - Makes it clearer that the variable should match configure.py settings No functional changes to the Seastar build process - purely a cleanup to reduce confusion when inspecting the build directory. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22437	2025-01-28 14:23:59 +03:00
Kefu Chai	57b14220ce	tree: remove unused "#include"s these unused includes were identified by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. in which, instead of using `seastarx.hh`, `readers/mutation_reader.hh`, use `using seastar::future` to include `future` in the global namespace, this makes `readers/mutation_reader.hh` a header exposing `future<>`, but this is not a good practice, because, unlike `seastarx.hh` or `seastar/core/future.hh`, `reader/mutation_reader.hh` is not responsible for exposing seastar declarations. so, we trade the using statement for `#include "seastarx.hh"` in that file to decouple the source files including it from this header because of this statement. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22439	2025-01-28 14:12:06 +03:00
Kefu Chai	ce2d235c88	docs: correct typo of "abd" to "and" Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22442	2025-01-28 14:11:02 +03:00
Pavel Emelyanov	3b081b4839	Merge 'TLS: reduce inotify usage by sharing reloadability across shards' from Calle Wilund Refs https://github.com/scylladb/seastar/issues/2513 Reloadable certificates use inotify instances. On a loaded test (CI) server, we've seen cases where we literally run out of capacity. This patch uses the extended callback and reload capability of seastar TLS to only create actual reloadable certificate objects on shard 0 for our main TLS points (encryption only does TLS on shard 0 already). Closes scylladb/scylladb#22425 * github.com:scylladb/scylladb: alternator: Make server peering sharded and reuse reloadable certs messaging_service: Share reloadability of certificates across shards redis/controller: Reuse shard 0 reloadable certificates for all shards controller: Reuse shard 0 reloadable certificates for all shards generic_server: Allow sharing reloadability of certificates across shards	2025-01-28 14:08:17 +03:00
Tomasz Grabiec	50d9d5b98e	Merge 'truncate: trigger truncate logic from a transition state instead of global topology request' from Ferenc Szili Truncate table for tablets is implemented as a global topology operation. However, it does not have a transition state associated with it, and performs the truncate logic in `topology_coordinator::handle_global_request()` while `topology::tstate` remains empty. This creates problems because `topology::is_busy()` uses transition_state to determine if the topology state machine is busy, and will return false even though a truncate operation is ongoing. This change introduces a new topology transition `topology::transition_state::truncate_table` and moves the truncate logic to a new method `topology_coordinator::handle_truncate_table()`. This method is now called as a handler of the `truncate_table` transition state instead of a handler of the `trunacate_table` global topology request. This PR is a bugfix for truncate with tables and needs to be backported to 2025.1 Closes scylladb/scylladb#22452 * github.com:scylladb/scylladb: truncate: trigger truncate logic from transition state instead of global request handler truncate: add truncate_table transition state	2025-01-28 12:05:57 +01:00
Asias He	0682b1c716	repair: Skip hints and batchlog flush in case of nodes down The flush api could not detect if the node is down and fail the flush before the timeout. This patch detects if there is down node and skip the flush if so, since the flush will fail after the timeout in this case anyway. The slowness due to the flush timeout in compaction_test.py::TestCompaction::test_delete_tombstone_gc_node_down is fixed with this patch. Fixes #22413 Closes scylladb/scylladb#22445	2025-01-28 12:04:42 +01:00
Calle Wilund	4843711fbd	alternator: Make server peering sharded and reuse reloadable certs Reuse reloadability across shards by limiting reload to shard 0, and use call to other shards to reload other shards certs.	2025-01-27 16:16:24 +00:00
Calle Wilund	15d1664a5c	messaging_service: Share reloadability of certificates across shards Only create reloadable cert object on shard 0, and call other shards on reload callback to reload other shards "manually".	2025-01-27 16:16:24 +00:00
Calle Wilund	5f7c733b1e	redis/controller: Reuse shard 0 reloadable certificates for all shards Provide a getter to "listen" method and only use full reloadable object on shard 0.	2025-01-27 16:16:24 +00:00
Calle Wilund	aab35e6806	controller: Reuse shard 0 reloadable certificates for all shards Provide a getter to "listen" method and only use full reloadable object on shard 0.	2025-01-27 16:16:23 +00:00
Calle Wilund	c59c87c233	generic_server: Allow sharing reloadability of certificates across shards Adds an optional callback to "listen", returning the shard local object instance. If provided, instead of creating a "full" reloadable cerificate object, only do so on shard 0, and use callback to reload other shards "manually".	2025-01-27 16:16:23 +00:00
Botond Dénes	b70dccb638	sstables: disk_types: disk_set_of_tagged_union: boost::variant -> std::variant In the spirit of using standard-library types, instead of boost ones where possible. Although a disk type, it is serialized/deserialized with custom code, so the change shouldn't cause any changes in the disk representation.	2025-01-27 09:29:26 -05:00
Botond Dénes	a095b3bb80	scylla-gdb.py: std_variant: fix get() It calls self.get_with_type() with one too many params.	2025-01-27 09:29:26 -05:00
Botond Dénes	08f1aecc1e	sstables: disk_types: remove unused disk_tagged_union	2025-01-27 09:29:26 -05:00
Anna Stuchlik	b2a718547f	doc: remove Enterprise labels and directives This PR removes the now redundant Enterprise labels and directives from the ScyllDB documentation. Fixes https://github.com/scylladb/scylladb/issues/22432 Closes scylladb/scylladb#22434	2025-01-27 16:01:48 +02:00
Asias He	0ab64551c5	storage_service: Reject nodetool removenode force It is almost always a bad idea to run removenode force. This means a node is removed without the remaining nodes to stream data that they should own after the removal. This will make the cluster into a worse state than a node being down. One can use one of the following procedure instead: 1) Fix the dead node and move it back to the cluster 2) Run replace ops to replace the dead node 3) Run removenode ops again We have seen misuse of nodetool removenode force by users again and again. This patch rejects it so it can not be misused anymore. Fixes scylladb/scylladb#15833 Closes scylladb/scylladb#15834	2025-01-27 14:50:18 +01:00
Anna Stuchlik	1d5ef3dddb	doc: enable the FIPS note in the ScyllaDB docs This commit removes the information about FIPS out of the '.. only:: enterprise' directive. As a result, the information will now show in the doc in the ScyllaDB repo (previously, the directive included the note in the Entrprise docs only). Refs https://github.com/scylladb/scylla-enterprise/issues/5020 Closes scylladb/scylladb#22374	2025-01-27 15:48:54 +02:00
Calle Wilund	bae5b44b97	docs: Remove configuration_encryptor Fixes #21993 Removes configuration_encryptor mention from docs. The tool itself (java) is not included in the main branch java tools, thus need not remove from there. Only the words. Closes scylladb/scylladb#22427	2025-01-27 15:45:18 +02:00
Nikos Dragazis	2fb95e4e2f	encrypted_file_test: Test reads beyond decrypted file length Add a test to reproduce a bug in the read DMA API of `encrypted_file_impl` (the file implementation for Encryption-at-Rest). The test creates an encrypted file that contains padding, and then attempts to read from an offset within the padding area. Although this offset is invalid on the decrypted file, the `encrypted_file_impl` makes no checks and proceeds with the decryption of padding data, which eventually leads to bogus results. Refs #22236. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `8f936b2cbc`)	2025-01-27 13:19:37 +00:00
Calle Wilund	e96cc52668	encrypted_file_impl: Check for reads on or past actual file length in transform Fixes #22236 If reading a file and not stopping on block bounds returned by `size()`, we could allow reading from (_file_size+1-15) (block boundary) and try to decrypt this buffer (last one). Check on last block in `transform` would wrap around size due to us being >= file size (l). Simplest example: Actual data size: 4095 Physical file size: 4095 + key block size (typically 16) Read from 4096: -> 15 bytes (padding) -> transform return _file_size - read offset -> wraparound -> rather larger number than we expected (not to mention the data in question is junk/zero). Just do an early bounds check and return zero if we're past the actual data limit. v2: * Moved check to a min expression instead * Added lengthy comment * Added unit test v3: * Fixed read_dma_bulk handling of short, unaligned read * Added test for unaligned read v4: * Added another unaligned test case	2025-01-27 13:19:37 +00:00
Avi Kivity	6b85c03221	Merge 'split: run set_split_mode() on all storage groups during all_storage_groups_split()' from Ferenc Szili `tablet_storage_group_manager::all_storage_groups_split()` calls `set_split_mode()` for each of its storage groups to create split ready compaction groups. It does this by iterating through storage groups using `std::ranges::all_of()` which is not guaranteed to iterate through the entire range, and will stop iterating on the first occurrence of the predicate (`set_split_mode()`) returning false. `set_split_mode()` creates the split compaction groups and returns false if the storage group's main compaction group or merging groups are not empty. This means that in cases where the tablet storage group manager has non-empty storage groups, we could have a situation where split compaction groups are not created for all storage groups. The missing split compaction groups are later created in `tablet_storage_group_manager::split_all_storage_groups()` which also calls `set_split_mode()`, and that is the reason why split completes successfully. The problem is that `tablet_storage_group_manager::all_storage_groups_split()` runs under a group0 guard, but `tablet_storage_group_manager::split_all_storage_groups()` does not. This can cause problems with operations which should exclude with compaction group creation. i.e. DROP TABLE/DROP KEYSPACE Fixes #22431 This is a bugfix and should be back ported to versions with tablets: 6.1 6.2 and 2025.1 Closes scylladb/scylladb#22330 * github.com:scylladb/scylladb: test: add reproducer and test for fix to split ready CG creation table: run set_split_mode() on all storage groups during all_storage_groups_split()	2025-01-27 13:13:42 +01:00
Anna Stuchlik	61c822715c	doc: add OS support for 2025.1 and reorganize the page This commit adds the OS support information for version 2025.1. In addition, the OS support page is reorganized so that: - The content is moved from the include page _common/os-support-info.rst to the regular os-support.rst page. The include page was necessary to document different support for OSS and Enterprise versions, so we don't need it anymore. - I skipped the entries for versions that won't be supported when 2025.1 is released: 6.1 and 2023.1. - I moved the definition of "supported" to the end of the page for better readability. - I've renamed the index entry to "OS Support" to be shorter on the left menu. Fixes https://github.com/scylladb/scylladb/issues/22474 Closes scylladb/scylladb#22476	2025-01-27 13:13:41 +01:00
Botond Dénes	9fc14f203b	Merge 'Simplify loading_cache_test and use manual_clock' from Benny Halevy This series exposes a Clock template parameter for loading_cache so that the test could use the manual_clock rather than the lowres_clock, since relying on the latter is flaky. In addition, the test load function is simplified to sleep some small random time and co_return the expected string, rather than reading it from a real file, since the latter's timing might also be flaky, and it out-of-scope for this test. Fixes #20322 * The test was flaky forever, so backport is required for all live versions. Closes scylladb/scylladb#22064 * github.com:scylladb/scylladb: tests: loading_cache_test: use manual_clock utils: loading_cache: make clock_type a template parameter test: loading_cache_test: use function-scope loader test: loading_cache_test: simlute loader using sleep test: lib: eventually: add sleep function param test: lib: eventually: make *EVENTUALLY_EQUAL inline functions	2025-01-27 13:13:41 +01:00
Yaron Kaikov	f91128096d	Update ScyllaDB version to: 2025.2.0-dev	2025-01-27 13:13:41 +01:00
Piotr Smaron	2a77405093	cql: inline abstract_replication_strategy::validate_replication_strategy This function is only called from 1 place and only contains 2 lines of code, just keeping it increases the code bloat	2025-01-27 12:13:45 +01:00
Piotr Smaron	3848293a43	cql: clean redundant code validating replication strategy options Most of the code from `recognized_options` is either incorrect or lacks any implementation, for example: - comments for Everywhere and Local strategies are contradictory, first says to allow all options, second says that the strategy doesn't accept any options, even though both functions have the same implementation, - for Local & Everywhere strategies the same logic is repeated in `validate_options` member functions, i.e. this function does nothing, - for NetworkTopology this function returns DC names and tablet options, but tablet options are empty; OTOH this strategy also accepts 'replication_factor' tag, which was ommitted, - for SimpleStrategy this function returns `replication_factor`, but this is also validated in `validate_options` function called just before the removed function. All of it makes `validate_replication_strategy` work incorrectly. That being said, 3 tests fail because of this logic's removal, so it did something after all. The failing tests are commented out, so that the CI passes, and will be restored in the next commit(s).	2025-01-27 12:01:59 +01:00
Andrzej Jackowski	5651cc49ed	audit: make categories, tables, and keyspaces liveupdatable This change: - Set liveness::LiveUpdate for audit_categories, audit_tables, and audit_keyspaces - Keep const reference to db::config in audit, so current config values can be obtained by audit implementation - Implement function audit::update_config to parse given string, update audit datastructures when needed, and log the changes. - Add observers to call audit::update_config when categories, tables, or keyspaces configuration changes Fixes scylladb/scylla-enterprise#1789	2025-01-27 11:37:13 +01:00
Andrzej Jackowski	5d4eb5d2dc	audit: move static parsing functions above audit constructors This change: - Swap static function and audit constructors in audit.cc This is a preparatory commit for enabling liveupdate of audit categories, tables, and keyspaces. It allows future use of static parsing functions in audit constructor.	2025-01-27 11:35:35 +01:00
Andrzej Jackowski	609d7b2725	audit: move statement_category to string conversion to static function This change: - Move audit_info::category_string to a new static function - Start using the new function in audit_info::category_string This is a preparatory commit for enabling liveupdate of audit categories, tables, and keyspaces. The newly created static function will be required for proper logging of audit categories.	2025-01-27 11:35:35 +01:00
Andrzej Jackowski	99b4a79df0	audit: start audit even with empty categories/tables/keyspaces This change: - Remove code that prevented audit from starting if audit_categories, audit_tables, and audit_keyspaces are not configured This is a preparatory commit for enabling liveupdate of audit categories, tables, and keyspaces. Without this change, audit is not started for particular categories/tables/keyspaces setting and it is unwanted behavior if customer can change audit configuration via liveupdate. This commit has performance implications if audit sink is set (meaning "audit"="table" or "audit"="syslog" in the config) but categories, tables, and keyspaces are not set to audit anything. Before this commit, audit was not started, so some operations (like creating audit_info or lookup in empty collections) were omitted.	2025-01-27 11:35:35 +01:00
Aleksandra Martyniuk	18cc79176a	api: task_manager: do not unregister tasks on get_status Currently, /task_manager/task_status_recursive/{task_id} and /task_manager/task_status/{task_id} unregister queries task if it has already finished. The status should not disappear after being queried. Do not unregister finished task when its status or recursive status is queried.	2025-01-27 11:23:45 +01:00
Aleksandra Martyniuk	e37d1bcb98	api: task_manager: add /task_manager/drain In the following patches, get_status won't be unregistering finished tasks. However, tests need a functionality to drop a task, so that they could manipulate only with the tasks for operations that were invoked by these tests. Add /task_manager/drain/{module} to unregister all finished tasks from the module. Add respective nodetool command.	2025-01-27 11:23:45 +01:00
Aleksandra Martyniuk	54e7f2819c	test: add test to check if repair handles no_such_keyspace	2025-01-27 09:49:50 +01:00
Aleksandra Martyniuk	bfb1704afa	repair: handle keyspace dropped Currently, data sync repair handles most no_such_keyspace exceptions, but it omits the preparation phase, where the exception could be thrown during make_global_effective_replication_map. Skip the keyspace repair if no_such_keyspace is thrown during preparations.	2025-01-27 09:37:47 +01:00
Avi Kivity	a23a3110b5	utils: config_file: forward_declare boost::program_options classes Avoid pulling in boost dependencies when all we need is the class name. Closes scylladb/scylladb#22453	2025-01-27 10:45:43 +03:00
Takuya ASADA	fb4c7dc3d8	dist: Support FIPS mode - To make Scylla able to run in FIPS-compliant system, add .hmac files for crypto libraries on relocatable/rpm/deb packages. - Currently we just write hmac value on *.hmac files, but there is new .hmac file format something like this: ``` [global] format-version = 1 [lib.xxx.so.yy] path = /lib64/libxxx.so.yy hmac = <hmac> ``` Seems like GnuTLS rejects fips selftest on .libgnutls.so.30.hmac when file format is older one. Since we need to absolute path on "path" directive, we need to generate .libgnutls.so.30.hmac in older format on create-relocatable-script.py, Signed-off-by: Takuya ASADA <syuu@scylladb.com> Closes scylladb/scylladb#22384	2025-01-26 22:49:21 +02:00
Kefu Chai	0237913337	sstables: Migrate from boost::adaptors::indexed to std::views::enumerate This change modernizes the codebase by: - Replacing Boost's indexed adaptor with C++20's std::views::enumerate - Removing unnecessary Boost header inclusion With this change, we can: - Reduce external dependencies - Leverage standard library features - Improve long-term code maintainability Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22469	2025-01-26 20:51:14 +02:00
Jan Łakomy	9561ae5fc8	types: implement vector_type_impl The vector is a fixed-length array of non-null specified type elements. Implement serialization, deserialization, comparison, JSON and Lua support, and other functionalities. Co-authored-by: Dawid Pawlik <501149991dp@gmail.com>	2025-01-26 19:36:41 +01:00
Gleb Natapov	fbfef6b28a	topology coordinator: do not update topology on address change Since now topology does not contain ip addresses there is no need to create topology on an ip address change. Only peers table has to be updated, so call a function that does peers table update only.	2025-01-26 17:49:05 +02:00
Gleb Natapov	ef929c5def	topology coordinator: split out the peer table update functionality from raft state application Raft topology state application does two things: re-creates token metadata and updates peers table if needed. The code for both task is intermixed now. The patch separates it into separate functions. Will be needed in the next patch.	2025-01-26 17:47:38 +02:00
Avi Kivity	60cdf62fae	Merge 'Remove sharded<system_distributed_keyspace>& argument from storage_service::join_cluster()' from Pavel Emelyanov There's such a reference on storage_service itself, it can use this->_sys_dist_ks instead thus making its API (both internal and external) a bit simpler. Closes scylladb/scylladb#22483 * github.com:scylladb/scylladb: storage_service: Drop sys_dist_ks argument from track_upgrade_progress_to_topology_coordinator() storage_service: Drop sys_dist_ks argument from raft_state_monitor_fiber() storage_service: Drop sys_dist_ks argument from join_topology() storage_service: Drop sys_dist_ks argument from join_cluster()	2025-01-26 15:56:37 +02:00
Kefu Chai	769162de91	tree: correct misspellings these misspellings were identified by codespell. let's fix them. one of them is a part of a user visble string. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22443	2025-01-26 15:54:06 +02:00
Kefu Chai	d1c222d9bd	config: specialize config_from_string() for sstring Specialize config_from_string() for sstring to resolve lexical_cast stream state parsing limitation. This enables correct handling of empty string configurations, such as setting an empty value in CQL: ```cql UPDATE system.config SET value='' WHERE name='allowed_repair_based_node_ops'; ``` Previous implementation using boost::lexical_cast would fail due to EOF stream state, incorrectly rejecting valid empty string conversions. Fixes scylladb/scylladb#22491 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22492	2025-01-26 15:53:12 +02:00
Kefu Chai	4a268362b9	compress: fix compressor initialization order by making namespace_prefix a function Fixes a race condition where COMPRESSOR_NAME in zstd.cc could be initialized before compressor::namespace_prefix due to undefined global variable initialization order across translation units. This was causing ZstdCompressor to be unregistered in release builds, making it impossible to create tables with Zstd compression. Replace the global namespace_prefix variable with a function that returns the fully qualified compressor name. This ensures proper initialization order and fixes the registration of the ZstdCompressor. Fixes scylladb/scylladb#22444 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22451	2025-01-26 13:43:02 +02:00
Kefu Chai	1151062b2a	Update seastar submodule * seastar a9bef537...18221366 (33): > io_queue: fix static member access to comply with CWG2813 > build: add missing include in program_options.cc > coroutine: move operator co_await(exception) into seastar::coroutine namespace > fair_queue: Mark entry constructor explicit > test: Add perf test to measure the "cost" of chain wakeup > websocket: Support clients that do not specify subprotocol > websocket: Accept plain const& to string as subprotocol > perf_tests: Inline print_text_header() into stdout_printer > perf_tests: Right-align numeric metrics in markdown tables > scripts/addr2line.py: fix hanging with the new llvm-addr2line version > Revert "rpc stream: do not abort stream queue if stream connection was closed without error" > websocket: Convert connection::read_http_upgrade_request() to use coros > rpc stream: do not abort stream queue if stream connection was closed without error > linux-aio: remove cpu reduction suggestions > gitignore: ignore directories that match "build*" > perf_tests: make column generic > net: replace deprecated ip::address_v4::from_string() > file: remove deprecated file lifetime hint APIs > semaphore: expiry_handler: tunnel exception_ptr to entry > tests: unit: refactor expected_exception > semaphore: return early exception before appending wait_list > semaphore: expiry_handler: refactor exception getters > abortable_fifo: support OnAbort callbacks accepting exception_ptr > abort_on_expiry: fix typos in comments > abort_on_expiry: request_abort with timed_out_error > Add missing include in dpdk_rte.hh > build: use path to libraries in .pc > httpd: drop unnecessary dependencies from httpd.hh > build: allow CMake to find Boost using package config > print: remove deprecated print() functions > github: s/ubuntu-latest/ubuntu-24.04/ > perf_tests: coroutinize main loop > add perf_tests_perf Closes scylladb/scylladb#22466	2025-01-26 12:54:14 +02:00
Asias He	4018dc7f0d	Introduce file stream for tablet File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. The following patches are imported from the scylla enterprise: ) Merge 'Introduce file stream for tablet' from Asias He This patch uses Seastar RPC stream interface to stream sstable files on network for tablet migration. It streams sstables instead of mutation fragments. The file based stream has multiple advantages over the mutation streaming. - No serialization or deserialization for mutation fragments - No need to read and process each mutation fragments - On wire data is more compact and smaller In the test below, a significant speed up is observed. Two nodes, 1 shard per node, 1 initial_tablets: - Start node 1 - Insert 10M rows of data with c-s - Bootstrap node 2 Node 1 will migration data to node2 with the file stream. Test results: 1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0] Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds 2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058] Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds Test Summary: File stream v.s. Mutation stream improvements - Stream bandwidth = 836 / 128 (MB/s) = 6.53X - Stream time = 23.60 / 1.08 (Seconds) = 21.85X - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X Closes scylladb/scylla-enterprise#3438 github.com:scylladb/scylla-enterprise: tests: Add file_stream_test streaming: Implement file stream for tablet ) streaming: Use new take_storage_snapshot interface The new take_storage_snapshot returns a file object instead of a file name. This allows the file stream sender to read from the file even if the file is deleted by compaction. Closes scylladb/scylla-enterprise#3728 ) streaming: Protect unsupported file types for file stream Currently, we assume the file streamed over the stream_blob rpc verb is a sstable file. This patch rejects the unsupported file types on the receiver side. This allows us to stream more file types later using the current file stream infrastructure without worrying about old nodes processing the new file types in the wrong way. - The file_ops::noop is renamed to file_ops::stream_sstables to be explicit about the file types - A missing test_file_stream_error_injection is added to the idl Fixes: #3846 Tests: test_unsupported_file_ops Closes scylladb/scylla-enterprise#3847 ) idl: Add service::session_id id to idl It will be used in the next patch. Refs #3907 ) streaming: Protect file stream with topology_guard Similar to "storage_service, tablets: Use session to guard tablet streaming", this patch protects file stream with topology_guard. Fixes #3907 ) streaming: Take service topology_guard under the try block Taking the service::topology_guard could throw. Currently, it throws outside the try block, so the rpc sink will not be closed, causing the following assertion: ``` scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual seastar::rpc::sink_impl<netw::serializer, streaming::stream_blob_cmd_data>::~sink_impl() [Serializer = netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion `this->_con->get()->sink_closed()' failed. ``` To fix, move more code including the topology_guard taking code to the try block. Fixes https://github.com/scylladb/scylla-enterprise/issues/4106 Closes scylladb/scylla-enterprise#4110 ) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho We're not preserving the SSTable state across file based migration, so staging SSTables for example are being placed into main directory, and consequently, we're mixing staging and non-staging data, losing the ability to continue from where the old replica left off. It's expected that the view update backlog is transferred from old into new replica, as migration doesn't wait for leaving replica to complete view update work (which can take long). Elasticity is preferred. So this fix guarantees that the state of the SSTable will be preserved by propagating it in form of subdirectory (each subdirectory is statically mapped with a particular state). The staging sstables aren't being registered into view update generator yet, as that's supposed to be fixed in OSS (more details can be found at https://github.com/scylladb/scylladb/issues/19149). Fixes #4265. Closes scylladb/scylla-enterprise#4267 * github.com:scylladb/scylla-enterprise: tablet: Preserve original SSTable state with file based tablet migration sstables: Add get method for sstable state ) sstable: (Re-)add shareabled_components getter ) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund Fixes #4246 Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier. Ensures we transfer and pre-process scylla metadata for streamed file blobs first, then properly apply receiving nodes local config by using a source and sink layer exported from sstables, which handles things like ordering, metadata filtering (on source) as well as handling metadata and proper IO paths when writing data on receiver node (sink). This implementation maintains the statelessness of the current design, and the delegated sink side will re-read and re-write the metadata for each component processed. This is a little wasteful, but the meta is small, and it is less error prone than trying to do caching cross-shards etc. The transport is isolated from the knowledge. This is an alternative/complement to #4436 and #4472, fixing the underlying issue. Note that while the layers/API:s here allows easy fixing of other fundamental problems in the feature (such as destination location etc), these are not included in the PR, to keep it as close to the current behaviour as possible. Closes scylladb/scylla-enterprise#4646 * github.com:scylladb/scylla-enterprise: raft_tests: Copy/add a topology test with encryption file streaming: Use sstable source/sink to transfer snapshots sstables: Add source and sink objects + producers for transfering a snapshot sstable::types: Add remove accessor for extension info in metadata ) The change for error injection in merge commit 966ea5955dd8760: File streaming now has "stream_mutation_fragments" error injection points so test_table_dropped_during_streaming works with file streaming. ) doc: document file-based streaming This commit adds a description of the file-based streaming feature to the documentation. It will be displayed in the docs using the scylladb_include_flag directive after https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0, and, in turn, branch-2024.2. Refs https://github.com/scylladb/scylla-enterprise/issues/4585 Refs https://github.com/scylladb/scylla-enterprise/issues/4254 Closes scylladb/scylla-enterprise#4587 ) doc: move File-based streaming to the Tablets source file-based-streaming This commit moves the description of file-based streaming from a common include file to the regular doc source file where tablets are described. Closes scylladb/scylla-enterprise#4652 ) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference Closes scylladb/scylladb#22467	2025-01-26 12:51:59 +02:00
Pavel Emelyanov	856832911d	storage_service: Drop sys_dist_ks argument from track_upgrade_progress_to_topology_coordinator() It's unused argument. The only caller is relaxed too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-24 12:29:40 +03:00
Pavel Emelyanov	1e93f51977	storage_service: Drop sys_dist_ks argument from raft_state_monitor_fiber() And the final drop of that kind -- switch to using this->_sys_dist_ks here too Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-24 12:29:03 +03:00
Pavel Emelyanov	248456cb9a	storage_service: Drop sys_dist_ks argument from join_topology() Similarly to previous patch, there's this->_sys_dist_ks thing Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-24 12:28:27 +03:00
Pavel Emelyanov	ca9b59f3b2	storage_service: Drop sys_dist_ks argument from join_cluster() Storage service has _sys_dist_ks onboard and can just use it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-24 12:26:32 +03:00
Avi Kivity	f4b1ad43d4	gdb: protect debug::the_database from lto Clang 18.1 with lto gained the ability to eliminate dead stores. Since debug::the_database is write-only as far as the compiler understands (it is read only by gdb), all writes to it are eliminated. Protect writes to the variable by marking it volatile. Closes scylladb/scylladb#22454	2025-01-23 22:26:04 +02:00
Pavel Emelyanov	eee3681d86	Merge 'tree: restore header compilation (${mode}-headers)' from Botond Dénes Our CI accidentally switched to using CMake to compile scylla and it looks like CMake doesn't run the `${mode}-headers` command correctly and some missing-include in headers managed to slip in. Compile fix, no backport needed. Closes scylladb/scylladb#22471 * github.com:scylladb/scylladb: test/raft/replication.hh: add missing include <fmt/std.h> test/boost/bptree_validation.hh: add missing include <fmt/format.h>	2025-01-23 15:35:55 +03:00
Botond Dénes	e038473887	test/raft/replication.hh: add missing include <fmt/std.h>	2025-01-23 07:29:01 -05:00
Botond Dénes	e60e575cb0	test/boost/bptree_validation.hh: add missing include <fmt/format.h>	2025-01-23 06:05:57 -05:00
Benny Halevy	32b7cab917	tests: loading_cache_test: use manual_clock Relying on a real-time clock like lowres_clock can be flaky (in particular in debug mode). Use manual_clock instead to harden the test against timing issues. Fixes #20322 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-23 09:28:08 +02:00
Benny Halevy	0841483d68	utils: loading_cache: make clock_type a template parameter So the unit test can use manual_clock rather than lowres_clock which can be flaky (in particular in debug mode). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-23 09:28:08 +02:00
Benny Halevy	b258f8cc69	test: loading_cache_test: use function-scope loader Rather than a global function, accessing a thread-local `load_count`. The thread-local load_count cannot be used when multiple test cases run in parallel. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-23 09:28:07 +02:00
Benny Halevy	d68829243f	test: loading_cache_test: simlute loader using sleep This test isn't about reading values from file, but rather it's about the loading_cache. Reading from the file can sometimes take longer than the expected refresh times, causing flakiness (see #20322). Rather than reading a string from a real file, just sleep a random, short time, and co_return the string. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-23 09:28:07 +02:00
Benny Halevy	934a9d3fd6	test: lib: eventually: add sleep function param To allow support for manual_clock instead of seastar::sleep. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-23 09:28:05 +02:00
Pavel Emelyanov	4edd327c4f	Revert "repair: add repair_service gate" This reverts commit `32ab58cdea`. Now repair service starts before and stops after storage server, so the problem described in the commit is no longer relevant.	2025-01-22 19:25:56 +03:00
Pavel Emelyanov	fff5b8adbc	main: Start repair before storage service The latter service uses repair, but not the vice-versa, so the correct (de)initialization order should be the same. refs: #2737 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-22 19:21:40 +03:00
Pavel Emelyanov	c5aa185e1b	repair: Check for sharded<view-builder> when constructing row_level_repair Currently initialization order of repair and view-builder is not correct, so there are several places in repair code that check for v.b. to be initialized before doing anything. There's one more place that needs that care -- the construction of row_level_repair object. The class instantiates helper objects that reply on view_builder to be fully initialized and is itself created by many other task types from repair code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-22 19:21:40 +03:00
Avi Kivity	0092bb5831	Merge 'main: rename `cql_sg_stats` metrics on scheduling group rename' from Piotr Dulikowski This PR contains the missing part of a fix for scylladb/scylla-enterprise#4912 which was omitted during migration of workload prioritization to the source available repository. Even though the regression test for it was ported, it was silently made ineffective by a different fix (scylladb/scylla-enterprise#4764), so this PR also improves the test. Fixes: scylladb/scylladb#22404 No need to backport - service levels are not yet a part of any source-available release. Closes scylladb/scylladb#22416 * github.com:scylladb/scylladb: test/auth_cluster: make test_service_level_metric_name_change useful main: rename `cql_sg_stats` metrics on scheduling group rename	2025-01-22 14:22:09 +02:00
Benny Halevy	b509644972	test: lib: eventually: make *EVENTUALLY_EQUAL inline functions rather then macros. This is a first cleanup step before adding a sleep function parameter to support also manual_clock. Also, add a call to BOOST_REQUIRE_EQUAL/BOOST_CHECK_EQUAL, respectively, to make an error more visible in the test log since those entry points print the offending values when not equal. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 12:47:33 +02:00
Ferenc Szili	9fa254e9a8	truncate: trigger truncate logic from transition state instead of global request handler Before this change, the logic of truncate for tablets was triggered from topology_coordinator::handle_global_request(). This was done without using a topology transition state which remained empty throughout the truncate handler's execution. This change moves the truncate logic to a new method topology_coordinator::handle_truncate_table(). This method is now called as a handler of the truncate_table topology transition state instead of a handler of the trunacate_table global topology request.	2025-01-22 11:08:26 +01:00
Ferenc Szili	29ead7014e	truncate: add truncate_table transition state Truncate table for tablets is implemented as a global topology operation. However, it does not have a transition state associated with it, and performs the truncate logic in handle_global_request() while topology::tstate remains empty. This creates problems because topology::is_busy() uses transition_state to determine if the topology state machine is busy, and will return false even though a truncate operation is ongoing. This change adds a new transition state: truncate_table	2025-01-22 10:44:36 +01:00
Avi Kivity	59d3a66d18	Revert "Introduce file stream for tablet" This reverts commit `8208688178`. It was contributed from enterprise, but is too different from the original for me to merge back.	2025-01-22 09:42:20 +02:00
Benny Halevy	23284f038f	table: flush: synchronize with stop() When the table is stopped, all compaction groups are stopped, and as part of that, they are flushing their memtables. To synchronize with stop-induced flush operation, move _pending_flushes_phaser.stop() later in table::stop(), after all compaction groups are flushed and stopped. This way, in table::flush, if we see that the phaser is already closed, we know that there is nothing to flush, otherwise we start a flush operation that would be waited on by a parallel table::stop(). Fixes #22243 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#22339	2025-01-22 09:23:09 +02:00
Benny Halevy	dd21d591f6	network_topology_strategy_test: add tablets rack_aware_view_pairing tests Test the simple case of base/view pairing with replication_factor that is a multiple of the number of racks. As well as the complex case when simple_tablets_rack_aware_view_pairing is not possible. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Benny Halevy	249b793674	view: get_view_natural_endpoint: implement rack-aware pairing for tablets Enabled with the tablets_rack_aware_view_pairing cluster feature rack-aware pairing pairs base to view replicas that are in the same dc and rack, using their ordinality in the replica map We distinguish between 2 cases: - Simple rack-aware pairing: when the replication factor in the dc is a multiple of the number of racks and the minimum number of nodes per rack in the dc is greater than or equal to rf / nr_racks. In this case (that includes the single rack case), all racks would have the same number of replicas, so we first filter all replicas by dc and rack, retaining their ordinality in the process, and finally, we pair between the base replicas and view replicas, that are in the same rack, using their original order in the tablet-map replica set. For example, nr_racks=2, rf=4: base_replicas = { N00, N01, N10, N11 } view_replicas = { N11, N12, N01, N02 } pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 } Note that we don't optimize for self-pairing if it breaks pairing ordinality. - Complex rack-aware pairing: when the replication factor is not a multiple of nr_racks. In this case, we attempt best-match pairing in all racks, using the minimum number of base or view replicas in each rack (given their global ordinality), while pairing all the other replicas, across racks, sorted by their ordinality. For example, nr_racks=4, rf=3: base_replicas = { N00, N10, N20 } view_replicas = { N11, N21, N31 } pairing would be: { N00, N31 }, { N10, N11 }, { N20, N21 } cross-rack pair If we'd simply stable-sort both base and view replicas by rack, we might end up with much worse pairing across racks: { N00, N11 }, { N10, N21 }, { N20, N31 }* * cross-rack pair Fixes scylladb/scylladb#17147 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Benny Halevy	0e388a1594	view: get_view_natural_endpoint: handle case when there are too few view replicas Currently, when reducing RF, we may drop replicas from the view before dropping replicas from the base table. Since get_view_natural_endpoint is allowed to return a disengaged optional if it can't find a pair for the base replica, replcace the exiting assertion with code handling this case, and count those events in a new table metric: total_view_updates_failed_pairing. Note that this does not fix the root cause for the issue which is the unsynchronized dropping of replicas, that should be atomic, using a single group0 transaction. Refs scylladb/scylladb#21492 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Benny Halevy	858b0a51f8	view: get_view_natural_endpoint: track replica locator::nodes Rather than tracking only the replica host_id, keep track of the locator:::node& to prepare for rack-aware pairing. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Benny Halevy	2589115337	locator: topology: consult local_dc_rack if node not found by host_id Like get_location by inet_address, there is a case, when a node is replaced that the node cannot be found by host_id. Currently get_location would return a reference based on the nullptr which might cause a segfault as seen in testing. Instead, if the host_id is of the location, revert to calling get_location() which consults this_node or _cfg.local_dc_rack. Otherwise, throw a runtime_error. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Benny Halevy	2bfebc1f62	locator: node: add dc and rack getters To simplify searching and sorting using std::ranges projection using std::mem_fn. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Benny Halevy	6f8f03f593	feature_service: add tablet_rack_aware_view_pairing feature Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Benny Halevy	cadd33bdf6	view: get_view_natural_endpoint: refactor predicate function Simplify the function logic by calculating the predicate function once, before scanning all base and view replicas, rather than testing the different options in the inner loop. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Benny Halevy	97f85e52f7	view: get_view_natural_endpoint: clarify documentation "self-pairing" is enabled only when use_legacy_self_pairing is enabled. That is currently unclear in the documentation comment for this function. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Benny Halevy	6d4de30a3a	view: mutate_MV: optimize remote_endpoints filtering check Currently we always lookup both `my_address` and target_endpoint in remote_endpoints. But if my_address is in remote_endpoints in some cases the second lookup is not needed, so do it only to decide whether to swap target_endpoint with my_address, if found in remote_endpoints, or to remove that match, if target_endpoint is already pending as well. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Benny Halevy	91d3bf8ebc	view: mutate_MV: lookup base and view erms synchronously Although at the moment storage_service::replicate_to_all_cores may yield between updating the base and view tables with a new effective_replication_map, scylladb/scylladb#21781 was submitted to change that so that they are updated atomically together. This change prepares for the above change, and is harmless at the moment. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Benny Halevy	d04cdce0fc	view: mutate_MV: calculate keyspace-dependent flags once All view live in the same keyspace as their base table, so calculate the keyspace-dependent flags once, outside the per-view update loop. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-22 09:04:24 +02:00
Kefu Chai	8080658df7	backup_task: remove a component once it is uploaded Previously, during backup, SSTable components are preserved in the snapshot directory even after being uploaded. This leads to redundant uploads in case of failed backups or restarts, wasting time and resources (S3 API calls). This change - adds an optional query parameter named "move_files" to "/storage_service/backup" API. if it is set to "true", SSTable components are removed once they are backed up to object storage. - conditionally removes SSTable components from the snapshot directory once they are successfully uploaded to the target location. This prevents re-uploading the same files and reduces disk usage. This change only "Refs" #20655, because, we can move further optimize the backup process, consider: - Sending HEAD requests to S3 to check for existing files before uploading. - Implementing support for resuming partially uploaded files. Fixes #21799 Refs #20655 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-22 11:17:01 +08:00
Kefu Chai	32d22371b9	backup_task: extract component upload logic into dedicated function Extract upload_component() from backup_task_impl::do_backup() to improve readability and prepare for optional post-upload cleanup. This refactoring simplifies the main backup flow by isolating the upload logic into its own function. The change is motivated by an upcoming feature that will allow optional deletion of components after successful upload, which would otherwise add complexity to do_backup(). Refs scylladb/scylladb#21799 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-22 11:17:01 +08:00
Kefu Chai	ded31d1917	snapshot-ctl: change snapshot_ctl::run_snapshot_modify_operation() to regular func instead of implementing `snapshot_ctl::run_snapshot_modify_operation()` as a template function, let it accept as plain noncopyable_function instance, and return `future<>`. Previously, `snapshot_ctl::run_snapshot_modify_operation` was a template function that accepted a templated functor parameter. This approach limited its usability because callers needed to be defined in the same translation unit as the template implementation. however, `backup_task_impl` is defined in another translation unit, and we intend to call `snapshot_ctl::run_snapshot_modify_operation()` in its implementation. so in order to cater this need, there are two options: 1. to move the definition of the template function into the header file. but the downside is that this slows down the compilation by increaing the size of header. 2. to change the template function to a regular function. This change restricts the function's parameter to a specific signature. However, all current callers already return a `future<>` object, so there's minimal impact. in this change, we implement the second option. this allows us to call this function from another translation unit. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-22 11:17:01 +08:00
Ferenc Szili	8bff7786a8	test: add reproducer and test for fix to split ready CG creation This adds a reproducer for #22431 In cases where a tablet storage group manager had more than one storage group, it was possible to create compaction groups outside the group0 guard, which could create problems with operations which should exclude with compaction group creation.	2025-01-21 18:43:10 +01:00
Ferenc Szili	24e8d2a55c	table: run set_split_mode() on all storage groups during all_storage_groups_split() tablet_storage_group_manager::all_storage_groups_split() calls set_split_mode() for each of its storage groups to create split ready compaction groups. It does this by iterating through storage groups using std::ranges::all_of() which is not guaranteed to iterate through the entire range, and will stop iterating on the first occurance of the predicate (set_split_mode()) returning false. set_split_mode() creates the split compaction groups and returns false if the storage group's main compaction group or merging groups are not empty. This means that in cases where the tablet storage group manager has non-empty storage groups, we could have a situation where split compaction groups are not created for all storage groups. The missing split compaction groups are later created in tablet_storage_group_manager::split_all_storage_groups() which also calls set_split_mode(), and that is the reason why split completes successfully. The problem is that tablet_storage_group_manager::all_storage_groups_split() runs under a group0 guard, and tablet_storage_group_manager::split_all_storage_groups() does not. This can cause problems with operations which should exclude with compaction group creation. i.e. DROP TABLE/DROP KEYSPACE	2025-01-21 18:42:53 +01:00
Nadav Har'El	a8805c4fc1	Merge 'cql3, test, utils: switch from boost::adaptors::uniqued to utils::views:unique ' from Kefu Chai In order to reduce the dependency on external libraries, and for better integration with ranges in C++ standard library. let's use the homebrew `utils::views::unique()` before unique is accepted by the C++ standard. --- it's a cleanup, hence no need to backport. Closes scylladb/scylladb#22393 * github.com:scylladb/scylladb: cql3, test: switch from boost::adaptors::uniqued to utils::views:unique utils: implement drop-in replacement for replacing boost::adaptors::uniqued	2025-01-21 19:06:21 +02:00
Sergey Zolotukhin	38caabe3ef	test: Fix inconsistent naming of the log files. The log file names created in `scylla_cluster.py` by `ScyllaClusterManager` and files to be collected in conftest.py by `manager` should be in sync. This patch fixes the issue, originally introduced in scylladb/scylladb#22192 Fixes scylladb/scylladb#22387 Backports: 6.1 and 6.2. Closes scylladb/scylladb#22415	2025-01-21 10:45:17 +02:00
Kefu Chai	ccb7b4e606	cql3, test: switch from boost::adaptors::uniqued to utils::views:unique In order to reduce the dependency on external libraries, and for better integration with ranges in C++ standard library. let's use the homebrew `utils::views::unique()` before unique is accepted by the C++ standard. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-21 16:24:45 +08:00
Kefu Chai	d5d251da9a	utils: implement drop-in replacement for replacing boost::adaptors::uniqued Add a custom implementation of boost::adaptors::uniqued that is compatible with C++20 ranges library. This bridges the gap between Boost.Range and the C++ standard library ranges until std::views::unique becomes available in C++26. Currently, the unique view is included in [P2214](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2760r0.html) "A Plan for C++ Ranges Evolution", which targets C++26. The implementation provides: - A lazy view adaptor that presents unique consecutive elements - No modification of source range - Compatibility with C++20 range views and concepts - Lighter header dependencies compared to Boost This resolves compilation errors when piping C++20 range views to boost::adaptors::uniqued, which fails due to concept requirements mismatch. For example: ```c++ auto range = std::views::take(n) \| boost::adaptors::uniqued; // fails ``` This change also offers us a lightweight solution in terms of smaller header dependency. While std::ranges::unique exists in C++23, it's an eager algorithm that modifies the source range in-place, unlike boost::adaptors::uniqued which is a lazy view. The proposed std::views::unique (P2214) targeting C++26 would provide this functionality, but is not yet available. This implementation serves as an interim solution for filtering consecutive duplicate elements using range views until std::views::unique is standardized. For more details on the differences between `std::ranges::unique` and `boost::adaptors::uniqued`: - boost::adaptors::uniqued is a view adaptor that creates a lazy view over the original range. It: * Doesn't modify the source range * Returns a view that presents unique consecutive elements * Is non-destructive and lazy-evaluated * Can be composed with other views - std::ranges::unique is an algorithm that: * Modifies the source range in-place * Removes consecutive duplicates by shifting elements * Returns an iterator to the new logical end * Cannot be used as a view or composed with other range adaptors Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-21 16:24:45 +08:00
Tomasz Grabiec	8059090a29	Merge 'Cache base info for view schemas in the schema registry' from Wojciech Mitros Currently, when we load a frozen schema into the registry, we lose the base info if the schema was of a view. Because of that, in various places we need to set the base info again, and in some codepaths we may miss it completely, which may make us unable to process some requests (for example, when executing reverse queries on views). Even after setting the base info, we may still lose it if the schema entry gets deactivated due to all `schema_ptr`s temporarily dying. To fix this, this patch adds the base schema to the registry, alongside the view schema. We store just the frozen base schema, so that we can transfer it across shards. With the base schema, we can now set the base info when returning the schema from the registry. As a result, we can now assume that all view schemas returned by the registry have base_info set. In this series we also make sure that the view schemas in the registry are kept up-to-date in regards to base schema changes. Fixes https://github.com/scylladb/scylladb/issues/21354 This issue is a bug, so adding backport labels 6.1 and 6.2 Closes scylladb/scylladb#21862 * github.com:scylladb/scylladb: test: add test for schema registry maintaining base info for views schema_registry: avoid setting base info when getting the schema from registry schema_registry: update cached base schemas when updating a view schema_registry: cache base schemas for views db: set base info before adding schema to registry	2025-01-21 00:17:54 +01:00
Nadav Har'El	3e16b80014	Merge 'Reject create table with compact storage' from Benny Halevy As discussed in https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813, compact storage tables are deprecated. Yet, there's is nothing in the code that prevents users from creating such tables. This patch adds a live-updateable config option: `enable_create_table_with_compact_storage`, set to `false` by default, that require users to opt-in in order to create new tables WITH COMPACT STORAGE. Refs scylladb/scylladb#12263, scylladb/scylladb#16375 * Since this guardrail is an enhancement, no backport is needed Closes scylladb/scylladb#16403 * github.com:scylladb/scylladb: docs: ddl: document the deprecation of compact tables test: enable_create_table_with_compact_storage for tests that need it config: add enable_create_table_with_compact_storage	2025-01-20 22:02:02 +02:00
Piotr Dulikowski	780ff17ff5	test/auth_cluster: make test_service_level_metric_name_change useful The test test_service_level_metric_name_change was originally introduced to serve as a regression test for scylladb/scylla-enterprise#4912. Before the fix, some per-scheduling-group metrics would not get adjusted when the scheduling group gets renamed (which does happen for SL-managed scheduling groups) and it would be possible to attempt to register metrics with the same set of labels, resulting in an error. However, in scylladb/scylla-enterprise#4764, another bug was fixed which affected the test. Before a service level is created, a "test" scheduling group can be created by service level controller if it is unsure whether it is allowed to create more scheduling groups or not. If creation of the scheduling group succeeds, it is put into the pool of scheduling groups to be reused when a new service level is created. Therefore, the node handling CREATE SERVICE LEVEL would always use the scheduling group that was originally created for the sake of the test as a SG for the new service level. All of the above is intentional and was actually fixed by the aforementioned issue. However, the test scheduling groups would always get unique names and, therefore, the error would no longer reproduce. However, the faulty logic that ran previously and caused the bug still runs - when a node updates its service levels cache on group0 reload. The test previously used only one node. Fix it by starting two nodes instead of one at the beginning of the test and by serving all service level commands to the first node - were the issue not fixed, the error would get triggered on the second node.	2025-01-20 18:17:15 +01:00
Piotr Dulikowski	de153a2ba7	main: rename `cql_sg_stats` metrics on scheduling group rename This commit contains the part of a fix for scylladb/scylla-enterprise#4912 that was accidentally omitted when workload prioritization were ported from enterprise to scylladb.git repo. Without it, the metrics created by `cql_sg_stats` would not be updated, leading to wrong scheduling group names being used in metrics' names, and could lead to "double metric registration errors" in some unlucky circumstances where a scheduling group would be created, destroyed and then created again. Fixes: scylladb/scylladb#22404	2025-01-20 18:16:46 +01:00
Tomasz Grabiec	c7f78edc78	Merge 'repair: Wire repair_time in system.tablets for tombstone gc' from Asias He The repair_time in system.tablets will be updated when repair runs successfully. We can now use it to update the repair time for tombstone gc, i.e, when the system.tablets.repair_time is propagated, call gc_state.update_repair_time() on the node that is the owner of the tablet. Since `b3b3e880d3` ("repair: Reduce hints and batchlog flush"), the repair time that could be used for tombstone gc might be smaller than when the repair is started, so the actual repair time for tombstone gc is returned by the repair rpc call from the repair master node. Fixes #17507 New feature. No backport is needed. Closes scylladb/scylladb#21896 * github.com:scylladb/scylladb: repair: Stop using rpc to update repair time for repairs scheduled by scheduler repair: Wire repair_time in system.tablets for tombstone gc test: Disable flush_cache_time for two tablet repair tests test: Introduce guarantee_repair_time_next_second helper repair: Return repair time for repair_service::repair_tablet service: Add tablet_operation.hh	2025-01-20 18:08:49 +01:00
Benny Halevy	88ae067ddb	everywhere: add skeletal support for the in_memory_tables feature Forward-ported from scylla-enterprise. Note that the feature has been deprecated and the implementation is provided only for backward compatibility with pre-existing features and schema. Tested manually after adding the following to feature_service: ``` gms::feature workload_prioritization { *this, "WORKLOAD_PRIORITIZATION"sv }; ``` Launched a single-node cluster running 2023.1.10 ``` cqlsh> create KEYSPACE ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; cqlsh> create TABLE ks.test ( pk int PRIMARY KEY, val int ) WITH compaction = {'class': 'InMemoryCompactionStrategy'}; ``` log: ``` Scylla version 2023.1.10-0.20241227.21cffccc1ccd with build-id bd65b8399cb13b713a87e57fe333cfcabfd50be7 starting ... ... INFO 2024-12-27 19:45:16,563 [shard 0] migration_manager - Create new ColumnFamily: org.apache.cassandra.config.CFMetaData@0x600000f1b400[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName=ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,readRepairChance=0,dcLocalReadRepairChance=0,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,keyValidator=org.apache.cassandra.db.marshal.Int32Type,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,in_memory=false,version=5529c631-c47a-11ef-bd1d-4295734ce5a8,droppedColumns={},collections={},indices={}] INFO 2024-12-27 19:45:16,564 [shard 0] schema_tables - Creating ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440 ``` Upgraded to this branch and started scylla. Verified that ks.test was successfuly loaded: log: ``` INFO 2024-12-27 19:48:58,115 [shard 0:main] init - Scylla version 6.3.0~dev-0.20241227.a64c6dfc153e with build-id f9496134a09cf2e55d3865b9e9ff499f672aa7da starting ... ... WARN 2024-12-27 19:53:02,948 [shard 1:main] CompactionStrategy - InMemoryCompactionStrategy is no longer supported. Defaulting to NullCompactionStrategy. ... INFO 2024-12-27 19:53:02,948 [shard 0:main] database - Keyspace ks: Reading CF test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ec88d510-6aff-344a-914d-541d37081440 storage=/home/bhalevy/scylladb/data/ks/test-5529c630c47a11efbd1d4295734ce5a8 ``` Then, tested: ``` cqlsh> describe KEYSPACE ks; CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false}; CREATE TABLE ks.test ( pk int, val int, PRIMARY KEY (pk) ) WITH bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} AND comment = '' AND compaction = {'class': 'InMemoryCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND speculative_retry = '99.0PERCENTILE'; cqlsh> alter TABLE ks.test with compaction = {'class': 'SizeTieredCompactionStrategy'}; cqlsh> describe KEYSPACE ks; CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true AND tablets = {'enabled': false}; CREATE TABLE ks.test ( pk int, val int, PRIMARY KEY (pk) ) WITH bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} AND comment = '' AND compaction = {'class': 'SizeTieredCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND speculative_retry = '99.0PERCENTILE' AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'}; ``` log: ``` INFO 2024-12-27 19:56:40,465 [shard 0:stmt] migration_manager - Update table 'ks.test' From org.apache.cassandra.config.CFMetaData@0x60000362d800[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.InMemoryCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ec88d510-6aff-344a-914d-541d37081440,droppedColumns={},collections={},indices={}] To org.apache.cassandra.config.CFMetaData@0x60000336e000[cfId=5529c630-c47a-11ef-bd1d-4295734ce5a8,ksName==ks,cfName=test,cfType=Standard,comparator=org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type),comment=,tombstoneGcOptions={"mode":"timeout","propagation_delay_in_seconds":"3600"},gcGraceSeconds=864000,minCompactionThreshold=4,maxCompactionThreshold=32,columnMetadata=[ColumnDefinition{name=pk, type=org.apache.cassandra.db.marshal.Int32Type, kind=PARTITION_KEY, componentIndex=0, droppedAt=-9223372036854775808}, ColumnDefinition{name=val, type=org.apache.cassandra.db.marshal.Int32Type, kind=REGULAR, componentIndex=null, droppedAt=-9223372036854775808}],compactionStrategyClass=class org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy,compactionStrategyOptions={enabled=true},compressionParameters={sstable_compression=org.apache.cassandra.io.compress.LZ4Compressor},bloomFilterFpChance=0.01,memtableFlushPeriod=0,caching={"keys":"ALL","rows_per_partition":"ALL"},cdc={},defaultTimeToLive=0,minIndexInterval=128,maxIndexInterval=2048,speculativeRetry=99.0PERCENTILE,triggers=[],isDense=false,version=ecccf010-c47b-11ef-b52c-622f2f0e87c4,droppedColumns={},collections={},indices={}] INFO 2024-12-27 19:56:40,466 [shard 0: gms] schema_tables - Altering ks.test id=5529c630-c47a-11ef-bd1d-4295734ce5a8 version=ecccf010-c47b-11ef-b52c-622f2f0e87c4 ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#22068	2025-01-20 16:55:17 +02:00
Kefu Chai	9d6ec45730	build: support wasm32-wasip1 target in configure.py Update configure.py to use wasm32-wasip1 as an alternative to wasm32-wasi, matching the behavior previously implemented for CMake builds in `8d7786cb0e`. This ensures consistent WASI target handling across both build systems. Refs #20878 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22386	2025-01-20 16:43:22 +02:00
Asias He	8208688178	Introduce file stream for tablet File based stream is a new feature that optimizes tablet movement significantly. It streams the entire SSTable files without deserializing SSTable files into mutation fragments and re-serializing them back into SSTables on receiving nodes. As a result, less data is streamed over the network, and less CPU is consumed, especially for data models that contain small cells. The following patches are imported from the scylla enterprise: ) Merge 'Introduce file stream for tablet' from Asias He This patch uses Seastar RPC stream interface to stream sstable files on network for tablet migration. It streams sstables instead of mutation fragments. The file based stream has multiple advantages over the mutation streaming. - No serialization or deserialization for mutation fragments - No need to read and process each mutation fragments - On wire data is more compact and smaller In the test below, a significant speed up is observed. Two nodes, 1 shard per node, 1 initial_tablets: - Start node 1 - Insert 10M rows of data with c-s - Bootstrap node 2 Node 1 will migration data to node2 with the file stream. Test results: 1) File stream: bytes on wire = 1132006250 bytes, bw = 836MB/s [shard 0:stre] stream_blob - stream_sstables[eadaa8e0-a4f2-4cc6-bf10-39ad1ce106b0] Finished sending sstable_nr=2 files_nr=18 files={} range=(-1,9223372036854775807] bytes_sent=1132006250 stream_bw=836MB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 1.08004s seconds 2) Mutation stream: bytes on wire = 3030004736 bytes, bw = 125410.87 KiB/s = 128MB/s [shard 0:stre] stream_session - [Stream #406dc8b0-56b5-11ee-bc2d-000bf4871058] Streaming plan for Tablet migration-ks1-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=2958989 KiB, 125410.87 KiB/s [shard 0:stre] storage_service - Streaming for tablet migration of a4f68900-568a-11ee-b7b9-c2b13945eed2:1 took 23.5992s seconds Test Summary: File stream v.s. Mutation stream improvements - Stream bandwidth = 836 / 128 (MB/s) = 6.53X - Stream time = 23.60 / 1.08 (Seconds) = 21.85X - Stream bytes on wire = 3030004736 / 1132006250 (Bytes)= 2.67X Closes scylladb/scylla-enterprise#3438 github.com:scylladb/scylla-enterprise: tests: Add file_stream_test streaming: Implement file stream for tablet ) streaming: Use new take_storage_snapshot interface The new take_storage_snapshot returns a file object instead of a file name. This allows the file stream sender to read from the file even if the file is deleted by compaction. Closes scylladb/scylla-enterprise#3728 ) streaming: Protect unsupported file types for file stream Currently, we assume the file streamed over the stream_blob rpc verb is a sstable file. This patch rejects the unsupported file types on the receiver side. This allows us to stream more file types later using the current file stream infrastructure without worrying about old nodes processing the new file types in the wrong way. - The file_ops::noop is renamed to file_ops::stream_sstables to be explicit about the file types - A missing test_file_stream_error_injection is added to the idl Fixes: #3846 Tests: test_unsupported_file_ops Closes scylladb/scylla-enterprise#3847 ) idl: Add service::session_id id to idl It will be used in the next patch. Refs #3907 ) streaming: Protect file stream with topology_guard Similar to "storage_service, tablets: Use session to guard tablet streaming", this patch protects file stream with topology_guard. Fixes #3907 ) streaming: Take service topology_guard under the try block Taking the service::topology_guard could throw. Currently, it throws outside the try block, so the rpc sink will not be closed, causing the following assertion: ``` scylla: seastar/include/seastar/rpc/rpc_impl.hh:815: virtual seastar::rpc::sink_impl<netw::serializer, streaming::stream_blob_cmd_data>::~sink_impl() [Serializer = netw::serializer, Out = <streaming::stream_blob_cmd_data>]: Assertion `this->_con->get()->sink_closed()' failed. ``` To fix, move more code including the topology_guard taking code to the try block. Fixes https://github.com/scylladb/scylla-enterprise/issues/4106 Closes scylladb/scylla-enterprise#4110 ) Merge 'Preserve original SSTable state with file based tablet migration' from Raphael "Raph" Carvalho We're not preserving the SSTable state across file based migration, so staging SSTables for example are being placed into main directory, and consequently, we're mixing staging and non-staging data, losing the ability to continue from where the old replica left off. It's expected that the view update backlog is transferred from old into new replica, as migration doesn't wait for leaving replica to complete view update work (which can take long). Elasticity is preferred. So this fix guarantees that the state of the SSTable will be preserved by propagating it in form of subdirectory (each subdirectory is statically mapped with a particular state). The staging sstables aren't being registered into view update generator yet, as that's supposed to be fixed in OSS (more details can be found at https://github.com/scylladb/scylladb/issues/19149). Fixes #4265. Closes scylladb/scylla-enterprise#4267 * github.com:scylladb/scylla-enterprise: tablet: Preserve original SSTable state with file based tablet migration sstables: Add get method for sstable state ) sstable: (Re-)add shareabled_components getter ) Merge 'File streaming sstables: Use sstable source/sink to transfer snapshots' from Calle Wilund Fixes #4246 Alternative approach/better separation of concern, transport vs. sstable layer. Builds on #4472, but fancier. Ensures we transfer and pre-process scylla metadata for streamed file blobs first, then properly apply receiving nodes local config by using a source and sink layer exported from sstables, which handles things like ordering, metadata filtering (on source) as well as handling metadata and proper IO paths when writing data on receiver node (sink). This implementation maintains the statelessness of the current design, and the delegated sink side will re-read and re-write the metadata for each component processed. This is a little wasteful, but the meta is small, and it is less error prone than trying to do caching cross-shards etc. The transport is isolated from the knowledge. This is an alternative/complement to #4436 and #4472, fixing the underlying issue. Note that while the layers/API:s here allows easy fixing of other fundamental problems in the feature (such as destination location etc), these are not included in the PR, to keep it as close to the current behaviour as possible. Closes scylladb/scylla-enterprise#4646 * github.com:scylladb/scylla-enterprise: raft_tests: Copy/add a topology test with encryption file streaming: Use sstable source/sink to transfer snapshots sstables: Add source and sink objects + producers for transfering a snapshot sstable::types: Add remove accessor for extension info in metadata ) The change for error injection in merge commit 966ea5955dd8760: File streaming now has "stream_mutation_fragments" error injection points so test_table_dropped_during_streaming works with file streaming. ) doc: document file-based streaming This commit adds a description of the file-based streaming feature to the documentation. It will be displayed in the docs using the scylladb_include_flag directive after https://github.com/scylladb/scylladb/pull/20182 is merged, backported to branch-6.0, and, in turn, branch-2024.2. Refs https://github.com/scylladb/scylla-enterprise/issues/4585 Refs https://github.com/scylladb/scylla-enterprise/issues/4254 Closes scylladb/scylla-enterprise#4587 ) doc: move File-based streaming to the Tablets source file-based-streaming This commit moves the description of file-based streaming from a common include file to the regular doc source file where tablets are described. Closes scylladb/scylla-enterprise#4652 ) streaming: sstable_stream_sink_impl: abort: prevent null pointer dereference Closes scylladb/scylladb#22034	2025-01-20 16:43:21 +02:00
Yaniv Michael Kaul	7495237a33	Remove noexcept_traits.hh header file The content of the header file noexcept_traits.hh is unused throughout ScyllaDB's code base. As part of a greater effort to cleanup Scylla's code and reduce content in the root directory, this header file is simply removed. This is code cleanup - no need to backport. Fixes: https://github.com/scylladb/scylladb/issues/22117 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#22139	2025-01-20 16:43:21 +02:00
Nadav Har'El	8caea23d2a	test/cqlpy/run: fix regression in "--release" option The way that the "test/cqlpy/run --release" feature runs older Scylla releases is that it takes today's command line parameters and "fixes" it to conform to what old releases took. This approach was easy to implement (and the resulting "--release" feature is super useful), but the downside is that we need to update this fixup code whenever we add new options to the Scylla command line used by test/cqlpy/run.py. Commit `d04f376` made test/cqlpy/run.py use a new option "--experimental-features=views-with-tablets", so now we need to remove it when running older versions of Scylla. So this is what we do in this patch. Fixes #22349 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#22350	2025-01-20 16:43:21 +02:00
Nadav Har'El	12cbdfa095	test/cqlpy: add regression test for tombstone_gc in "desc table" The small cqlpy test in this patch is a regression test for issue #14390, which claimed that the Scylla-only "tombstone_gc" option is missing from the output of "describe table". This test shows that this report is not true, at least not when the "server-side describe" is used. "test/cqlpy/run --release ..." shows that this test passes on master and also for Scylla versions all the way back to Scylla 5.2 (Scylla 5.1 did not support server-side describe, so the test fails for that reason). This suggests that the report in issue #14390 was for old-style client-side (cqlsh) describe, which we no longer support, so this issue can be closed. Fixes #14390. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#22354	2025-01-20 16:43:21 +02:00
Avi Kivity	d2869ecb2b	partition_range_compat: drop dependency on boost ranges Unused anyway. Closes scylladb/scylladb#22359	2025-01-20 16:43:21 +02:00
Anna Stuchlik	e340d6a452	doc: remove Open Source references in the docs Fixes https://github.com/scylladb/scylladb/issues/22325 Closes scylladb/scylladb#22377	2025-01-20 16:43:21 +02:00
Botond Dénes	1f20f7810e	Merge 'main, encryption: correct misspellings' from Kefu Chai in this changeset, some misspellings identified by codespell were corrected. --- it's a cleanup, hence no need to backport. Closes scylladb/scylladb#22301 * github.com:scylladb/scylladb: ent/encryption: rename "sie" to "get_opt" ent,main: fix misspellings	2025-01-20 16:43:21 +02:00
Benny Halevy	5c77956205	docs: ddl: document the deprecation of compact tables Add a paragraph documenting the decision to deprecate the COMPACT STORAGE feature, and instruct the user how to enable the feature despite that. Note that we don't have an official migration strategy for users like `DROP COMPACT STORAGE`, which is not implemented at this time (See #3882). Fixes #16375 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-20 08:14:39 +02:00
Benny Halevy	f3ab00e61c	test: enable_create_table_with_compact_storage for tests that need it Now enable_create_table_with_compact_storage can be set to `false` by default in db/config. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-20 08:14:37 +02:00
Benny Halevy	0110eb0506	config: add enable_create_table_with_compact_storage As discussed in https://github.com/scylladb/scylladb/issues/12263#issuecomment-1853576813, compact storage tables are deprecated. Yet, there's is nothing in the code that prevents users from creating such tables. This patch adds a live-updateable config option: `enable_create_table_with_compact_storage` that require users to opt-in in order to create new tables WITH COMPACT STORAGE. The option is currently set to `true` by default in db/config to reduce the churn to tests and to `false` in scylla.yaml, for new clusters. TODO: once regressions tests that use compact storage are converted to enable the option, change the default in db/config to false. A unit test was added to test/cql-pytest that checks that the respective cql query fails as expected with the default option or when it is explicitly set to `false`, and that the query succeeds when the option is set to `true`. Note that `check_restricted_table_properties` already returns an optional warning, but it is only logged but not returned in the `prepared_statement`. Fixing that is out of the scope of this patch. See https://github.com/scylladb/scylladb/issues/20945 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-01-20 08:03:25 +02:00
Kefu Chai	1ef2d9d076	tree: migrate from boost::adaptors::transformed to std::views::transform Replace remaining uses of boost::adaptors::transformed with std::views::transform to reduce Boost dependencies, following the migration pattern established in `bab12e3a`. This change addresses recently merged code that reintroduced Boost header dependencies through boost::adaptors::transformed usage. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22365	2025-01-17 16:56:40 +02:00
Botond Dénes	47989b1503	Merge 'tasks: add tablet resize virtual task' from Aleksandra Martyniuk In this change, tablet_virtual_task starts supporting tablet resize (i.e. split and merge). Users can see running resize tasks - finished tasks are not presented with the task manager API. A new task state "suspended" is added. If a resize was revoked, it will appear to users as suspended. We assume that the resize was revoked when the tablet number didn't change. Fixes: #21366. Fixes: #21367. No backport, new feature Closes scylladb/scylladb#21891 * github.com:scylladb/scylladb: test: boost: check resize_task_info in tablet_test.cc test: add tests to check revoked resize virtual tasks test: add tests to check the list of resize virtual tasks test: add tests to check spilt and merge virtual tasks status test: test_tablet_tasks: generalize functions replica: service: add split virtual task's children replica: service: pass parent info down to storage_group::split tasks: children of virtual tasks aren't internal by default tasks: initialize shard in task_info ctor service: extend tablet_virtual_task::abort service: retrun status_helper struct from tablet_virtual_task::get_status_helper service: extend tablet_virtual_task::wait tasks: add suspended task state service: extend tablet_virtual_task::get_status service: extend tablet_virtual_task::contains service: extend tablet_virtual_task::get_stats service: add service::task_manager_module::get_nodes tasks: add task_manager::get_nodes tasks: drop noexcept from module::get_nodes replica: service: add resize_task_info static column to system.tablets locator: extend tablet_task_info to cover resize tasks	2025-01-17 14:24:07 +02:00
Piotr Dulikowski	6aa962f5f4	Merge 'Add audit subsystem for database operations' from Paweł Zakrzewski Introduces a comprehensive audit system to track database operations for security and compliance purposes. This change includes: Core Components: - New audit subsystem for logging database operations - Service level integration for proper resource management - CQL statement tracking with operation categories - Login process integration for tenant management Key Features: - Configurable audit logging (syslog/table) - Operation categorization (QUERY/DML/DDL/DCL/AUTH/ADMIN) - Selective auditing by keyspace/table - Password sanitization in audit logs - Service level shares support (1-1000) for workload prioritization - Proper lifecycle management and cleanup I ran the dtests for audit (manually enabled) and they pass. The in-repo tests pass. Notably, there should be no non-whitespace changes between this and scylla-enterprise Fixes scylladb/scylla-enterprise#4999 Closes scylladb/scylladb#22147 * github.com:scylladb/scylladb: audit: Add shares support to service level management audit: Add service level support to CQL login process audit: Add support to CQL statements audit: Integrate audit subsystem into Scylla main process audit: Add documentation for the audit subsystem audit: Add the audit subsystem	2025-01-17 13:14:55 +01:00
Kamil Braun	89ee2a6834	Merge 'drop ip addresses from token metadata' from Gleb Now that all topology related code uses host ids there is not point to maintain ip to id (and back) mappings in the token metadata. After the patch the mapping will be maintained in the gossiper only. The rest of the system will use host ids and in rare cases where translation is needed (mostly for UX compatibility reasons) the translation will be done using gossiper. Fixes: scylladb/scylla#21777 * 'gleb/drop-ip-from-tm-v3' of github.com:scylladb/scylla-dev: (57 commits) hint manager: do not translate ip to id in case hint manager is stopped already locator: token_metadata: drop update_host_id() function that does nothing now locator: topology: drop indexing by ips repair: drop unneeded code storage_service: use host_id to look for a node in on_alive handler storage_proxy: translate ips to ids in forward array using gossiper locator: topology: remove unused functions storage_service: check for outdated ip in on_change notification in the peers table storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology topology coordinator: change connection dropping code to work on host ids cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query locator: drop unused function from tablet_effective_replication_map api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead locator: token_metadata: remove unused ip based functions locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs gossiper: drop get_unreachable_token_owners functions storage_service: use gossiper to map ip to id in node_ops operations storage_service: fix indentation after the last patch storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node token_metadata: drop no longer used functions ...	2025-01-17 11:00:52 +01:00
Kefu Chai	4a5a00347f	utils: do not include unused headers these unused includes were identifier by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22201	2025-01-17 11:24:54 +03:00
Botond Dénes	55963f8f79	replica: remove noexcept from token -> tablet resolution path The methods to resolve a key/token/range to a table are all noexcept. Yet the method below all of these, `storage_group_for_id()` can throw. This means that if due to any mistake a tablet without local replica is attempted to be looked up, it will result in a crash, as the exception bubbles up into the noexcept methods. There is no value in pretending that looking up the tablet replica is noexcept, remove the noexcept specifiers so that any bad lookup only fails the operation at hand and doesn't crash the node. This is especially relevant to replace, which still has a window where writes can arrive for tablets that don't (yet) have a local replica. Currently, this results in a crash. After this patch, this will only fail the writes and the replace can move on. Fixes: #21480 Closes scylladb/scylladb#22251	2025-01-17 11:24:09 +03:00
Łukasz Paszkowski	adef719c43	api/storage_service: Remove unimplemented truncate API The API /storage_service/truncate/{ks} returns an unimplemented error when invoked. As we already have a CQL command, `TRUNCATE TABLE ks.cf` that causes the table to be truncated on all nodes, the API can be dropped. Due to the error, it is unused. Fixes https://github.com/scylladb/scylladb/issues/10520 No backport is required. A small cleanup of not working API. Closes scylladb/scylladb#22258	2025-01-17 11:21:05 +03:00
Pavel Emelyanov	14c3fbbf8c	Merge 'sstable_directory: do not load remote unshared sstables in process_descriptor()' from Lakshmi Narayanan Sreethar The sstable loader relied on the generation id to provide an efficient hint about the shard that owns an sstable. But, this hint was rendered ineffective with the introduction of UUID generation, as the shard id was no longer embedded in the generation id. This also became suboptimal with the introduction of tablets. Commit `0c77f77` addressed this issue by reading the minimum from disk to determine sstable ownership but this improvement was lost with commit `63f1969`, which optimistically assumed that hints would work most of the time, which isn't true. This commit restores that change - shard id of a table is deduced by reading minially from disk and then the sstable is fully loaded only if it belongs to the local shard. This patch also adds a testcase to verify that the sstable are loaded only in their respective shards. Fixes #21015 This fixes a regression and should be backported. Closes scylladb/scylladb#22263 * github.com:scylladb/scylladb: sstable_directory: do not load remote sstables in process_descriptor sstable_directory: update `load_sstable()` definition sstable_directory: reintroduce `get_shards_for_this_sstable()`	2025-01-17 11:17:54 +03:00
Asias He	387b2050df	repair: Stop using rpc to update repair time for repairs scheduled by scheduler If a tablet repair is scheduled by tablet repair scheduler, the repair time for tombstone gc will be updated when the system.tablet.repair_time is updated. Skip updating using rpc calls in this case.	2025-01-17 16:12:05 +08:00
Asias He	53e6025aa6	repair: Wire repair_time in system.tablets for tombstone gc The repair_time in system.tablets will be updated when repair runs successfully. We can now use it to update the repair time for tombstone gc, i.e, when the system.tablets.repair_time is propagated, call gc_state.update_repair_time() on the node that is the owner of the tablet. Since `b3b3e880d3` ("repair: Reduce hints and batchlog flush"), the repair time that could be used for tombstone gc might be smaller than when the repair is started, so the actual repair time for tombstone gc is returned by the repair rpc call from the repair master node. Fixes #17507	2025-01-17 16:12:05 +08:00
Asias He	0b2fef74bc	test: Disable flush_cache_time for two tablet repair tests The cache of the hints and batchlog flush makes the exact repair time check difficult in the test. Disabling it for two repair tests that check the exact repair time.	2025-01-17 16:12:05 +08:00
Asias He	23afbd938c	test: Introduce guarantee_repair_time_next_second helper The repair time granularity is seconds. This helper makes sure the repair time is different than the previous one.	2025-01-17 16:12:05 +08:00
Asias He	41a1eca072	repair: Return repair time for repair_service::repair_tablet The repair time returned by repair_service::repair_tablet considers the hints and batchlog flush time, so it could be used for the tombstone gc purpose.	2025-01-17 16:12:05 +08:00
Asias He	614c3380c6	service: Add tablet_operation.hh A tablet_operation_result struct is added to track the result of a tablet operation.	2025-01-17 16:12:05 +08:00
Avi Kivity	d6f7f873d0	utils: config_file: don't use extern fully specialized variable templates Declaring-but-not-defining a fully specialized template is a great way to cut dependencies between users and providers, but unfortunately not supported for variable templates. Clang 18 does support it, but apparently it is a misinterpretation of the standard, and was removed in clang 19. We started using this non-feature in `7ed89266b3`. The fix is to use function templates. This is more verbose as each specialization needs to define a static variable to return, but is fully supported. Closes scylladb/scylladb#22299	2025-01-17 11:06:50 +03:00
Botond Dénes	2428f22d3e	Update tools/python3 submodule * tools/python3 fbf12d02...8415caf4 (1): > dist: Support FIPS mode	2025-01-17 09:17:29 +02:00
Tzach Livyatan	a00ab65491	remove BETA from metric and API reference Closes scylladb/scylladb#22092	2025-01-16 19:25:51 -05:00
Łukasz Paszkowski	aad46bd6f3	reader_concurrency_semaphore: do_wait_admission(): remove dumping diagnostics The commit `b39ca29b3c` introduced detection of admission-waiter anomaly and dumps permit diagnostics as soon as the semaphore did not admit readers even though it could. Later on, the commit `bf3d0b3543` introduces the optimization where the admission check is moved to the fiber processing the _read_list. Since the semaphore no longer admits readers as soon as it can, dumping diagnostic errors is not necessary as the situation is not abnormal. Closes scylladb/scylladb#22344	2025-01-16 19:23:43 -05:00
Nadav Har'El	955ac1b7b7	test/alternator: close boto3 client before shutting down For several years now, we have seen a strange, and very rare, flakiness in Alternator tests described in issue #17564: We see all the test pass, pytest declares them to have passed, and while Python is existing, it crashes with a signal 11 (SIGSEGV). Because this happens exclusively in test/alternator and never in the test/cqlpy, we suspect that something that the test/alternator leaves behind but test/cqlpy does not, causes some race and crashes during shutdown. The immediate suspect is the boto3 library, or rather, the urllib3 library which it uses. This is more-or-less the only thing that test/alternator does which test/cqlpy doesn't. The urllib3 library keeps around pools of reusable connections, and it's possible (although I don't actually have any proof for it) that these open connections may cause a crash during shutdown. So in this patch I add to the "dynamodb" and "dynamodbstreams" fixtures (which all Alternator tests use to connect to the server), a teardown which calls close() for the boto3 client object. This close() call percolates down to calling clear() on urllib3's PoolManager. Hopefully, this will make some difference in the chance to crash during shutdown - and if it doesn't, it won't hurt. Refs #17564 Closes scylladb/scylladb#22341	2025-01-16 19:21:00 -05:00
Gleb Natapov	a40e810442	hint manager: do not translate ip to id in case hint manager is stopped already Since we do not stop storage proxy on shutdown this code can be called during shutdown when address map is no longer usable.	2025-01-16 16:37:08 +02:00
Gleb Natapov	1e4b2f25dc	locator: token_metadata: drop update_host_id() function that does nothing now	2025-01-16 16:37:08 +02:00
Gleb Natapov	50fb22c8f9	locator: topology: drop indexing by ips Do not track id to ip mapping in the topology class any longer. There are no remaining users.	2025-01-16 16:37:08 +02:00
Gleb Natapov	f9df092fd1	repair: drop unneeded code There is a code that creates a map from id to ip and then creates a vector from the keys of the map. Create a vector directly instead.	2025-01-16 16:37:08 +02:00
Gleb Natapov	12da203cae	storage_service: use host_id to look for a node in on_alive handler	2025-01-16 16:37:08 +02:00
Gleb Natapov	d45ce6fa12	storage_proxy: translate ips to ids in forward array using gossiper We already use it to translate reply_to, so do it for consistency and to drop ip based API usage.	2025-01-16 16:37:08 +02:00
Gleb Natapov	db73758655	locator: topology: remove unused functions	2025-01-16 16:37:07 +02:00
Gleb Natapov	fb28ff5176	storage_service: check for outdated ip in on_change notification in the peers table The code checks that it does not run for an ip address that is no longer in use (after ip address change). To check that we can use peers table and see if the host id is mapped to the address. If yes, this is the latest address for this host id otherwise this is an outdated entry.	2025-01-16 16:37:07 +02:00
Gleb Natapov	163099678e	storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology We want to drop ip from the locator::node.	2025-01-16 16:37:07 +02:00
Gleb Natapov	49fa1130ef	topology coordinator: change connection dropping code to work on host ids Do not use ip from topology::node, but look it up in address map instead. We want to drop ip from the topology::node.	2025-01-16 16:37:07 +02:00
Gleb Natapov	83d15b8e32	cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query We want to drop ip from the topology::node.	2025-01-16 16:37:07 +02:00
Gleb Natapov	5cd3627baa	locator: drop unused function from tablet_effective_replication_map	2025-01-16 16:37:07 +02:00
Gleb Natapov	122d58b4ad	api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead	2025-01-16 16:37:07 +02:00
Gleb Natapov	97f95f1dbd	locator: token_metadata: remove unused ip based functions	2025-01-16 16:37:07 +02:00
Gleb Natapov	3068e38baa	locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs	2025-01-16 16:37:07 +02:00
Gleb Natapov	0ec9f7de64	gossiper: drop get_unreachable_token_owners functions It is used by truncate code only and even there it only check if the returned set is not empty. Check for dead token owners in the truncation code directly.	2025-01-16 16:37:07 +02:00
Gleb Natapov	a7a7cdcf42	storage_service: use gossiper to map ip to id in node_ops operations Replace operation is special though. In case of replacing with the same IP the gossiper will not have the mapping, and node_ops RPC unfortunately does not send host id of a replaced node. For replace we consult peers table instead to find the old owner of the IP. A node that is replacing (the coordinator of the replace) will not have it though, but luckily it is not needed since it updates metadata during join_topology() anyway. The only thing that is missing there is add_replacing_endpoint() call which the patch adds.	2025-01-16 16:37:07 +02:00
Gleb Natapov	0db6136fa5	storage_service: fix indentation after the last patch	2025-01-16 16:37:07 +02:00
Gleb Natapov	9197b88e48	storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node The call already throw an error if there are more than one. Throw is there are zero as well and drop the loops.	2025-01-16 16:37:07 +02:00
Gleb Natapov	fcfd005023	token_metadata: drop no longer used functions	2025-01-16 16:37:07 +02:00
Gleb Natapov	7c4c485651	host_id_or_endpoint: use gossiper to resolve ip to id and back mappings host_id_or_endpoint is a helper class that hold either id or ip and translate one into another on demand. Use gossiper to do a translation there instead of token_metadata since we want to drop ip based APIs from the later.	2025-01-16 16:37:07 +02:00
Gleb Natapov	70cc014307	storage_service: ip_address_updater: check peers table instead of token_metadata whether ip was changed As part of changing IP address peers table is updated. If it has a new address the update can be skipped.	2025-01-16 16:37:07 +02:00
Gleb Natapov	8e55cc6c78	storage_service: fix logging When logger outputs a range it already does join, so no other join is needed.	2025-01-16 16:37:07 +02:00
Gleb Natapov	7556e3d045	topology coordinator: remove gossiper entry only if host id matches provided one Currently the entry is removed only if ip is not used by any normal or transitioning node. This is done to not remove a wrong entry that just happen to use the same ip, but the same can be achieved by checking host id in the entry.	2025-01-16 16:37:07 +02:00
Gleb Natapov	593308a051	node_ops, cdc: drop remaining token_metadata::get_endpoint_for_host_id() usage Use address map to translate id to ip instead. We want to drop ips from token_metadata.	2025-01-16 16:37:07 +02:00
Gleb Natapov	ae8dc595e1	hints: move id to ip translation into store_hint() function Also use gossiper to translate instead of token_metadata since we want to get rid of ip base APIs there.	2025-01-16 16:37:06 +02:00
Gleb Natapov	c7d08fe1fe	storage_service: change get_dc_rack_for() to work on host ids	2025-01-16 16:37:06 +02:00
Gleb Natapov	415e8de36e	locator: topology: change get_datacenter_endpoints and get_datacenter_racks to return host ids and amend users	2025-01-16 16:37:06 +02:00
Gleb Natapov	8a0fea5fef	locator: topology: drop is_me ip overload along with remaning users	2025-01-16 16:37:06 +02:00
Gleb Natapov	2ea8df2cf5	storage_proxy: drop is_alive that works on ip since it is not used any more	2025-01-16 16:37:06 +02:00
Gleb Natapov	8433947932	locator: topology: remove get_location overload that works on ip and its last users	2025-01-16 16:37:06 +02:00
Gleb Natapov	25eb98ecbc	locator: topology: drop no longer used ip based overloads	2025-01-16 16:37:06 +02:00
Gleb Natapov	315db647dd	consistency_level: drop templates since the same types of ranges are used by all the callers	2025-01-16 16:37:06 +02:00
Gleb Natapov	1b6e1456e5	messaging_service: drop the usage of ip based token_metadata APIs We want to drop ips from token_metadata so move to use host id based counterparts. Messaging service gets a function that maps from ips to id when is starts listening.	2025-01-16 16:37:06 +02:00
Gleb Natapov	da9b7b2626	storage_service: drop ip based topology::get_datacenter() usage We want to drop ips from the topology eventually.	2025-01-16 16:37:06 +02:00
Gleb Natapov	36ccc897e8	gossiper: change get_live_members and all its users to work on host ids	2025-01-16 16:37:06 +02:00
Gleb Natapov	7a3237c687	messaging_service: drop get_raw_version and knows_version The are unused. The version is always fixed.	2025-01-16 16:37:06 +02:00
Gleb Natapov	8cc09f4358	storage_service: do not use ip addresses from token_metadata in handling of a normal state Instead use gossiper and peers table to retrieve same information. Token_metadata is created from the mix of those two anyway. The goal is to drop ips from token_metadata entirely.	2025-01-16 16:37:06 +02:00
Gleb Natapov	5262bbafff	locator: drop no longer used ip based functions	2025-01-16 16:37:06 +02:00
Gleb Natapov	542360e825	test: drop inet_address usage from network_topology_strategy_test Move the test to work on host ids. IPs will be dropped eventually.	2025-01-16 16:37:06 +02:00
Gleb Natapov	9ea53a8656	storage_service: move describe ring and get_range_to_endpoint_map to use host ids inside and translate to ips at the last moment The functions are called from RESful API so has to return ips for backwards compatibility, but internally we can use host ids as long as possible and convert to ips just before returning. This also drops usage of ip based erm function which we want to get rid of.	2025-01-16 16:37:06 +02:00
Gleb Natapov	f03a575f3d	storage_service: move storage_service::get_natural_endpoints to use host ids internally and translate to ips before returning The function is called by RESful API so has to return ips for backwards compatibility, but internally we can use host ids as long as possible and convert to ips just before returning. This also drops usage of ip based erm function which we want to get rid of.	2025-01-16 16:37:06 +02:00
Gleb Natapov	6e6b2cfa63	storage_service: use existing util function instead of re-iplementing it locator/util.hh already has get_range_to_address_map which is exactly like the one in the storage_service. So remove the later one and use the former instead.	2025-01-16 16:37:06 +02:00
Gleb Natapov	58f8395bc2	storage_service: use gossiper instead of token_metadata to map ip to id in gossiper notifications We want to drop ips from token_metadata so move to different API to map ip to id.	2025-01-16 16:35:13 +02:00
Michał Chojnowski	16b3352ae7	build: fix -ffile-prefix-map cmake doesn't set a `-ffile-prefix-map` for source files. Among other things, this results in absolute paths in Scylla logs: ``` Jan 11 09:59:11.462214 longevity-tls-50gb-3d-master-db-node-2dcd4a4a-5 scylla[16339]: scylla: /jenkins/workspace/scylla-master/next/scylla/utils/refcounted.hh:23: utils::refcounted::~refcounted(): Assertion `_count == 0' failed. ``` And it results in absolute paths in gdb, which makes it a hassle to get gdb to display source code during debugging. (A build-specific `substitute-path` has to be configured for that). There is a `-file-prefix-map` rule for `CMAKE_BINARY_DIR`, but it's wrong. Patch `dbb056f4f7`, which added it, was misguided. What we want is to strip the leading components of paths up to the repository directory, both in __FILE__ macros and in debug info. For example, we want to convert /home/michal/scylla/replica/table.cc to replica/table.cc or ./replica/table.cc, both in Scylla logs and in gdb. What the current rule does is it maps `/home/michal/scylla/build` to `.`, which is wrong: it doesn't do anything about the paths outside of `build`, which are the ones we actually care about. This patch fixes the problem. Closes scylladb/scylladb#22311	2025-01-16 16:35:18 +03:00
Kefu Chai	8d7786cb0e	build: cmake: use wasm32-wasip1 as an alternative of wasm32-wasi wasm32-wasi has been removed in Rust 1.84 (Jan 5th, 2025). if one compiles the tree with Rust 1.84 or up, following build failure is expected: ``` [2/305] Building WASM /home/kefu/dev/scylladb/build/wasm/return_input.wasm FAILED: wasm/return_input.wasm /home/kefu/dev/scylladb/build/wasm/return_input.wasm cd /home/kefu/dev/scylladb/test/resource/wasm/rust && /usr/bin/cargo build --target=wasm32-wasi --example=return_input --locked --manifest-path=Cargo.toml --target-dir=/home/kefu/dev/scylladb/build/test/resource/wasm/rust && wasm-opt /home/kefu/dev/scylladb/build/test/resource/wasm/rust/wasm32-wasi//debug/examples/return_input.wasm -Oz -o /home/kefu/dev/scylladb/build/wasm/return_input.wasm && wasm-strip /home/kefu/dev/scylladb/build/wasm/return_input.wasm error: failed to run `rustc` to learn about target-specific information Caused by: process didn't exit successfully: `rustc - --crate-name ___ --print=file-names --target wasm32-wasi --crate-type bin --crate-type rlib --crate-type dylib --crate-type cdylib --crate-type staticlib --crate-type proc-macro --print=sysroot --print=split-debuginfo --print=crate-name --print=cfg` (exit status: 1) --- stderr error: Error loading target specification: Could not find specification for target "wasm32-wasi". Run `rustc --print target-list` for a list of built-in targets ``` in order to workaround this issue, let's check for supported target, and use wasm32-wasip1 if wasm32-wasi is not listed as the supported target. Refs #20878 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22320	2025-01-16 16:28:29 +03:00
Michał Chojnowski	38d94475f2	messaging_service: fix the piece of code which clears clients on shutdown() While this isn't strictly needed for anything, messaging_service is supposed to clear its RPC connection objects on stop, for debuggability reasons. But a recent change in this area broke that. std::bind creates copies of its arguments, so the `m.clear()` statement in stop_client() only clears a copy of the vector of shared pointers, instead of clearing the original vector. This patch fixes that. Fixes #22245 Closes scylladb/scylladb#22333	2025-01-16 16:26:18 +03:00
Andrei Chekun	29a69f495e	test.py: Mark the cluster dirty after each test for topology Currently, tests are reusing the cluster. This leads to the situation when test passes and leaves the cluster broken, that the next tests will try to clean up the Scylla working directory during starting the node. Timeout for starting is set to two minutes by default and sometimes cleaning the mess after several tests can take more time, so tests fails during adding the node to the cluster. Current PR marks the cluster dirty after the test, so no need to clean the Scylla working directory. The disadvantage of this way is increasing the time for tests execution. Observable increase is approximately one minutes for one repeat in dev mode: 22 min 35s vs. 23 min 41s. Closes scylladb/scylladb#22274	2025-01-16 13:51:18 +01:00
Botond Dénes	b2a03e03f7	Merge 'raft: Handle non-critical config update errors in when changing voter status.' from Sergey Zolotukhin When a node is bootstrapped and joined a cluster as a non-voter and changes it's role to a voter, errors can occur while committing a new Raft record, for instance, if the Raft leader changes during this time. These errors are not critical and should not cause a node crash, as the action can be retried. Fixes scylladb/scylladb#20814 Backport: This issue occurs frequently and disrupts the CI workflow to some extent. Backports are needed for versions 6.1 and 6.2. Closes scylladb/scylladb#22253 * github.com:scylladb/scylladb: raft: refactor `remove_from_raft_config` to use a timed `modify_config` call. raft: Refactor functions using `modify_config` to use a common wrapper for retrying. raft: Handle non-critical config update errors in when changing status to voter. test: Add test to check that a node does not fail on unknown commit status error when starting up. raft: Add run_op_with_retry in raft_group0.	2025-01-16 11:00:47 +02:00
Yaron Kaikov	f15bf8a245	Update ScyllaDB version to: 2025.1.0-dev Following the license changes in `f3eade2f62` Closes scylladb/scylladb#21978	2025-01-16 07:01:37 +02:00
Gleb Natapov	0c930199f8	storage_service: use gossiper to map id to ip instead of token_metadata in node_ops_cmd_handler We want to drop ips from token_metadata so move to different API to map ip to id.	2025-01-15 16:30:29 +02:00
Gleb Natapov	5d4d9fd31d	storage_service: force_remove_completion use address map to resolve id to ip instead of token metadata We want to drop ips from token_metadata so move to different API to map ip to id.	2025-01-15 16:30:29 +02:00
Gleb Natapov	f5fa4d9742	topology coordinator: drop get_endpoint_for_host_id_if_known usage Now that we have gossiper::get_endpoint_state_ptr that works on host ids there is no need to translate id to ip at all.	2025-01-15 16:30:29 +02:00
Gleb Natapov	b3f8b579c0	gossiper: add get_endpoint_state_ptr() function that works on host id Will be used later to simplify code.	2025-01-15 16:30:29 +02:00
Gleb Natapov	448282dc93	storage_proxy: used gossiper for map ip to host id in connection_dropped callback We want to drop ips from token_metadata so move to different API to map ip to id.	2025-01-15 16:30:29 +02:00
Gleb Natapov	ae821ba07a	repair: use gossiper to map ip to host id instead of token_metadata We want to drop ips from token_metadata so move to different API to map ip to id.	2025-01-15 16:30:29 +02:00
Gleb Natapov	8c85350d4b	db/virtual_tables: use host id from the gossiper endpoint state in cluster_status table The state always has host id now, so there is no point to looks it up in the token metadata.	2025-01-15 16:30:28 +02:00
Gleb Natapov	844cb090bf	view: do not use get_endpoint_for_host_id_if_known to check if a node is part of the topology Check directly in the topology instead.	2025-01-15 16:30:28 +02:00
Gleb Natapov	f685c7d0af	hints: use gossiper to map ip to id in wait_for_sync_point We want to drop ips from token_metadata so move to different API to map ip to id.	2025-01-15 16:30:28 +02:00
Gleb Natapov	4d7c05ad82	hints: move create_hint_sync_point function to host ids One of its caller is in the RESTful API which gets ips from the user, so we convert ips to ids inside the API handler using gossiper before calling the function. We need to deprecate ip based API and move to host id based.	2025-01-15 16:30:28 +02:00
Gleb Natapov	755ee9a2c5	api: do not use token_metadata to retrieve ip to id mapping in token_metadata RESTful endpoints We want to drop ip knowledge from the token_metadata, so use gossiper to retrieve the mapping instead.	2025-01-15 16:30:28 +02:00
Gleb Natapov	0d4d066fe3	hints: simplify can_send() function Since there is gossiper::is_alive version that works on host_id now there is no need to convert _ep_key to ip which simplifies the code a lot.	2025-01-15 16:30:28 +02:00
Gleb Natapov	50ee962033	service: address_map: add lookup function that expects address to exist We will add code that expects id to ip mapping to exist. If it does not it is better to fail earlier during testing, so add a function that calls internal error in case there is no mapping.	2025-01-15 16:30:28 +02:00
Paweł Zakrzewski	5b1da31595	audit: Add shares support to service level management Introduces shares-based workload prioritization for service levels, allowing fine-grained control over resource allocation between tenants. Key changes: - Add shares option to service level configuration: - Valid range: 1-1000 shares - Default value: 1000 shares - Enterprise-only feature gated by WORKLOAD_PRIORITIZATION feature flag - Extend CQL interface: - Add shares parameter to CREATE/ALTER SERVICE_LEVEL - Add shares column to system_distributed.service_levels - Add percentage calculation to LIST SERVICE_LEVELS - Add shares to DESCRIBE EFFECTIVE SERVICE_LEVEL output - Add validation: - Enforce shares range (1-1000) - Validate enterprise feature flag - Handle unset/delete markers properly - Update service level statements: - Add shares validation to CREATE/ALTER operations - Preserve shares through default value replacement - Add proper decomposition for shares values in result sets This change enables operators to control relative resource allocation between tenants using proportional share scheduling, while maintaining backward compatibility with existing service level configurations.	2025-01-15 15:01:05 +01:00
Botond Dénes	25fbe488ef	Merge 'view_builder: write status to tables before starting to build' from Michael Litvak When adding a new view for building, first write the status to the system tables and then add the view building step that will start building it. Otherwise, if we start building it before the status is written to the table, it may happen that we complete building the view, write the SUCCESS status, and then overwrite it with the STARTED status. The view_build_status table will remain in incorrect state indicating the view building is not complete. Fixes #20638 The PR contains few additional small fixes in separate commits related to the view build status table. It addresses flakiness issues in tests that use the view build status table to determine when view building is complete. The table may be in incorrect state due to these issues, having a row with status STARTED when it actually finished building the view, which will cause us to wait in `wait_for_view` until it timeouts. For testing I used a test similar to `test_view_build_status_with_replace_node`, but it only creates the views and calls `wait_for_view`. Without these commits it failed in 4/1024 runs, and with the commits it passed 2048/2048. backport to fix the bugs that affects previous versions and improve CI stability Closes scylladb/scylladb#22307 * github.com:scylladb/scylladb: view_builder: hold semaphore during entire startup view_builder: pass view name by value to write_view_build_status view_builder: write status to tables before starting to build	2025-01-15 15:01:17 +02:00
Calle Wilund	48fda00f12	tools: Add standard extensions and propagate to schema load Fixes #22314 Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.	2025-01-15 12:10:23 +00:00
Calle Wilund	00b40eada3	cql_test_env: Use add all extensions instead of inidividually	2025-01-15 12:08:09 +00:00
Calle Wilund	4aaf3df45e	main: Move extensions adding to function Easily called from elsewhere. The extensions we should always include (oxymoron?)	2025-01-15 12:07:39 +00:00
Calle Wilund	e6aa09e319	tomstone_gc: Make validate work for tools Don't crash if validation is done as part of loading a schema from file (schema.cql)	2025-01-15 12:06:02 +00:00
Paweł Zakrzewski	28bd699c51	audit: Add service level support to CQL login process This change integrates service level functionality into the CQL authentication and connection handling: - Add scheduling_group_name to client_data to track service level assignments - Extend SASL challenge interface to expose authenticated username - Modify connection processing to support tenant switching: - Add switch_tenant() method to handle scheduling group changes - Add process_until_tenant_switch() to handle request processing boundaries - Implement no_tenant() default executor - Add execute_under_tenant_type for scheduling group management - Update connection lifecycle to properly handle service level changes: - Initialize connections with default scheduling group - Support dynamic scheduling group updates when service levels change - Ensure proper cleanup of scheduling group assignments The changes enable proper scheduling group assignment and management based on authenticated users' service levels, while maintaining backward compatibility for connections without service level assignments.	2025-01-15 11:10:36 +01:00
Paweł Zakrzewski	98f5e49ea8	audit: Add support to CQL statements Integrates audit functionality into CQL statement processing to enable tracking of database operations. Key changes: - Add audit_info and statement_category to all CQL statements - Implement audit categories for different statement types: - DDL: Schema altering statements (CREATE/ALTER/DROP) - DML: Data manipulation (INSERT/UPDATE/DELETE/TRUNCATE/USE) - DCL: Access control (GRANT/REVOKE/CREATE ROLE) - QUERY: SELECT statements - ADMIN: Service level operations - Add audit inspection points in query processing: - Before statement execution - After access checks - After statement completion - On execution failures - Add password sanitization for role management statements - Mask plaintext passwords in audit logs - Handle both direct password parameters and options maps - Preserve query structure while hiding sensitive data - Modify prepared statement lifecycle to carry audit context - Pass audit info during statement preparation - Track audit info through statement execution - Support batch statement auditing This change enables comprehensive auditing of CQL operations while ensuring sensitive data is properly masked in audit logs.	2025-01-15 11:10:36 +01:00
Paweł Zakrzewski	1810e2e424	audit: Integrate audit subsystem into Scylla main process Adds core integration of the audit subsystem into Scylla's main process flow. Changes include: - Import audit subsystem header - Initialize audit system during server startup using configuration and token metadata - Start audit system after API server initialization with query processor and memory manager - Add proper shutdown sequence for audit system using RAII pattern - Add error handling for audit system initialization failures The audit system is now properly integrated into Scylla's lifecycle, ensuring: - Correct initialization order relative to other subsystems - Proper resource cleanup during shutdown - Graceful error handling for initialization failures	2025-01-15 11:10:36 +01:00
Paweł Zakrzewski	702e727e33	audit: Add documentation for the audit subsystem Adds detailed documentation covering the new audit subsystem: - Add new audit.md design document explaining: - Core concepts and design decisions - CQL extensions for audit management - Implementation details and trigger evaluation - Prior art references from other databases - Add user-facing documentation: - New auditing.rst guide with configuration and usage details - Integration with security documentation index - Updates to cluster management procedures - Updates to security checklist The documentation covers all aspects of the audit system including: - Configuration options and storage backends (syslog/table) - Audit categories (DCL/DDL/AUTH/DML/QUERY/ADMIN) - Permission model and security considerations - Failure handling and logging - Example configurations and output formats This ensures users have complete guidance for setting up and using the new audit capabilities.	2025-01-15 11:10:35 +01:00
Paweł Zakrzewski	384641194a	audit: Add the audit subsystem This change introduces a new audit subsystem that allows tracking and logging of database operations for security and compliance purposes. Key features include: - Configurable audit logging to either syslog or a dedicated system table (audit.audit_log) - Selective auditing based on: - Operation categories (QUERY, DML, DDL, DCL, AUTH, ADMIN) - Specific keyspaces - Specific tables - New configuration options: - audit: Controls audit destination (none/syslog/table) - audit_categories: Comma-separated list of operation categories to audit - audit_tables: Specific tables to audit - audit_keyspaces: Specific keyspaces to audit - audit_unix_socket_path: Path for syslog socket - audit_syslog_write_buffer_size: Buffer size for syslog writes The audit logs capture details including: - Operation timestamp - Node and client IP addresses - Operation category and query - Username - Success/failure status - Affected keyspace and table names	2025-01-15 11:10:35 +01:00
Piotr Dulikowski	72f28ce81e	Merge 'main, view: Pair view builder drain with its start' from Dawid Mędrek In this PR, we pair draining the view builder with its start. To better understand what was done and why, let's first look at the situation before this commit and the context of it: (a) The following things happened in order: 1. The view builder would be constructed. 2. Right after that, a deferred lambda would be created to stop the view builder during shutdown. 3. group0_service would be started. 4. A deferred lambda stopping group0_service would be created right after that. 5. The view builder would be started. (b) Because the view builder depends on group0_client, it couldn't be started before starting group0_service. On the other hand, other services depend on the view builder, e.g. the stream manager. That makes changing the order of initialization a difficult problem, so we want to avoid doing that unless we're sure it's the right choice. (c) Since the view builder uses group0_client, there was a possibility of running into a segmentation fault issue in the following scenario: 1. A call to `view_builder::mark_view_build_success()` is issued. 2. We stop group0_service. 3. `view_builder::mark_view_build_success()` calls `announce_with_raft()`, which leads to a use-after-free because group0_service has already been destroyed. This very scenario took place in scylladb/scylladb#20772. Initially, we decided to solve the issue by initializing group0_service a bit earlier (scylladb/scylladb@7bad8378c7). Unfortunately, it led to other issues described in scylladb/scylladb#21534, so we revert that patch. These changes are the second attempt to the problem where we want to solve it in a safer manner. The solution we came up with is to pair the start of the view builder with a deferred lambda that deinitializes it by calling `view_builder::drain()`. No other component of the system should be able to use the view builder anymore, so it's safe to do that. Furthermore, that pairing makes the analysis of initialization/deinitialization order much easier. We also solve the aformentioned use-after-free issue because the view builder itself will no longer attempt to use group0_client. Note that we still pair a deferred lambda calling `view_builder::stop()` with the construction of the view builder; that function will also call `view_builder::drain()`. Another notable thing is `view_builder::drain()` may be called earlier by `storage_service::do_drain()`. In other words, these changes cover the situation when Scylla runs into a problem when starting up. Backport: The patch I'm reverting made it to 6.2, so we want to backport this one there too. Fixes scylladb/scylladb#20772 Fixes scylladb/scylladb#21534 Closes scylladb/scylladb#21909 * github.com:scylladb/scylladb: test/topology_custom: Add test for Scylla with disabled view building main, view: Pair view builder drain with its start Revert "main,cql_test_env: start group0_service before view_builder"	2025-01-15 09:50:26 +01:00
Sergey Zolotukhin	228a66d030	raft: refactor `remove_from_raft_config` to use a timed `modify_config` call. To avoid potential hangs during the `remove_from_raft_config` operation, use a timed `modify_config` call. This ensures the operation doesn't get stuck indefinitely.	2025-01-15 09:49:17 +01:00
Sergey Zolotukhin	3da4848810	raft: Refactor functions using `modify_config` to use a common wrapper for retrying. There are several places in `raft_group0` where almost identical code is used for retrying `modify_config` in case of `commit_status_unknown` error. To avoid code duplication all these places were changed to use a new wrapper `run_op_with_retry`.	2025-01-15 09:49:17 +01:00
Sergey Zolotukhin	8c48f7ad62	raft: Handle non-critical config update errors in when changing status to voter. When a node is bootstrapped and joins a cluster as a non-voter, errors can occur while committing a new Raft record, for instance, if the Raft leader changes during this time. These errors are not critical and should not cause a node crash, as the action can be retried. Fixes scylladb/scylladb#20814	2025-01-15 09:49:15 +01:00
Takuya ASADA	f2a53d6a2c	dist: make p11-kit-trust.so able to work in relocatable package Currently, our relocatable package doesn't contains p11-kit-trust.so since it dynamically loaded, not showing on "ldd" results (Relocatable packaging script finds dependent libraries by "ldd"). So we need to add it on create-relocatable-pacakge.py. Also, we have two more problems: 1. p11 module load path is defined as "/usr/lib64/pkcs11", not referencing to /opt/scylladb/libreloc (and also RedHat variants uses different path than Debian variants) 2. ca-trust-source path is configured on build time (on Fedora), it compatible with RedHat variants but not compatible with Debian variants To solve these problems, we need to override default p11-kit configuration. To do so, we need to add an configuration file to /opt/scylladb/share/pkcs11/modules/p11-kit-trust.module. Also, ofcause p11-kit doesn't reference /opt/scylladb by default, we need to override load path by p11_kit_override_system_files(). On the configuration file, we can specify module load path by "modules: <path>", and also we can specify ca-trust-source path by "x-init-reservied: paths=<path>". Fixes scylladb/scylladb#13904 Closes scylladb/scylladb#22302	2025-01-15 10:09:17 +02:00
Kefu Chai	0d399702c7	api: include used header when building the tree on fedora 41, we could have following build failure: ``` FAILED: api/CMakeFiles/api.dir/Debug/system.cc.o /usr/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/Debug/seastar/gen/include -isystem /home/kefu/dev/scylladb/build/rust -isystem /home/kefu/dev/scylladb/abseil -I/usr/include/p11-kit-1 -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-explicit-specialization-storage-class -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -std=gnu++23 -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=19 -DSEASTAR_DEBUG -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEBUG_PROMISE -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_TYPE_ERASE_MORE -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_THREAD_NO_LIB -DBOOST_THREAD_DYN_LINK -DFMT_SHARED -DWITH_GZFILEOP -MD -MT api/CMakeFiles/api.dir/Debug/system.cc.o -MF api/CMakeFiles/api.dir/Debug/system.cc.o.d -o api/CMakeFiles/api.dir/Debug/system.cc.o -c /home/kefu/dev/scylladb/api/system.cc /home/kefu/dev/scylladb/api/system.cc:116:47: error: no member named 'lexical_cast' in namespace 'boost' 116 \| logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level"))); \| ~~~~~~~^ /home/kefu/dev/scylladb/api/system.cc:116:78: error: expected '(' for function-style cast or type construction 116 \| logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level"))); \| ~~~~~~~~~~~~~~~~~~^ /home/kefu/dev/scylladb/api/system.cc:118:25: error: no type named 'bad_lexical_cast' in namespace 'boost' 118 \| } catch (boost::bad_lexical_cast& e) { \| ~~~~~~~^ /home/kefu/dev/scylladb/api/system.cc:136:47: error: no member named 'lexical_cast' in namespace 'boost' 136 \| logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level"))); \| ~~~~~~~^ /home/kefu/dev/scylladb/api/system.cc:136:78: error: expected '(' for function-style cast or type construction 136 \| logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level"))); \| ~~~~~~~~~~~~~~~~~~^ /home/kefu/dev/scylladb/api/system.cc:140:25: error: no type named 'bad_lexical_cast' in namespace 'boost' 140 \| } catch (boost::bad_lexical_cast& e) { \| ~~~~~~~^ /home/kefu/dev/scylladb/api/system.cc:148:47: error: no member named 'lexical_cast' in namespace 'boost' 148 \| logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level"))); \| ~~~~~~~^ /home/kefu/dev/scylladb/api/system.cc:148:78: error: expected '(' for function-style cast or type construction 148 \| logging::log_level level = boost::lexical_cast<logging::log_level>(std::string(req.get_query_param("level"))); \| ~~~~~~~~~~~~~~~~~~^ /home/kefu/dev/scylladb/api/system.cc:150:25: error: no type named 'bad_lexical_cast' in namespace 'boost' 150 \| } catch (boost::bad_lexical_cast& e) { \| ~~~~~~~^ ``` in this change, we include the used header to address the build failure. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22303	2025-01-15 10:11:40 +03:00
Jenkins Promoter	f5f15c6d07	Update pgo profiles - aarch64	2025-01-15 04:49:45 +02:00
Jenkins Promoter	f021e16d0c	Update pgo profiles - x86_64	2025-01-15 04:26:42 +02:00
Sergey Zolotukhin	16053a86f0	test: Add test to check that a node does not fail on unknown commit status error when starting up. Test that a node is starting successfully if while joining a cluster and becoming a voter, it receives an unknown commit status error. Test for scylladb/scylladb#20814	2025-01-14 17:12:06 +01:00
Sergey Zolotukhin	775411ac56	raft: Add run_op_with_retry in raft_group0. Since when calling `modify_config` it's quite often we need to do retries, to avoid code duplication, a function wrapper that allows a function to be called with automatic retries in case of failures was added.	2025-01-14 17:12:04 +01:00
Kamil Braun	2eac7a2d61	Merge 'test/pylib: two trivial cleanups' from Kefu Chai - use "foo not in bar" instead of "not foo in bar" - test/pylib: use foo instead of `'{}'.format(foo)` --- it's a cleanup, hence no need to backport. Closes scylladb/scylladb#22066 * github.com:scylladb/scylladb: test/pylib: use `foo` instead of `'{}'.format(foo)` test/pylib: use "foo not in bar" instead of "not foo in bar"	2025-01-14 16:27:44 +01:00
Nadav Har'El	15c252fd8f	Merge 'docs: Update documentation on CREATE ROLE WITH HASHED PASSWORD' from Dawid Mędrek As part of #18750, we added a CQL statement CREATE ROLE WITH SALTED HASH that prevented hashing a password when creating a role, effectively leading to inserting a hash given by the user directly into the database. In #21350, we noticed that Cassandra had implemented a CQL statement of similar semantics but different syntax. We decided to rename Scylla's statement to be compatible with Cassandra. Unfortunately, we didn't notice one more difference between what we had in Scylla and what was part of Cassandra. Scylla's statement was originally supposed to only be used when restoring the schema and the user needn't have to be aware of its existence at all: the database produced a sequence of CQL statements that the user saved to a file and when a need to restore the schema arose, they would execute the contents of the file. That's why that although we documented the feature, it was only done in the necessary places. Those that weren't related to the backup & restore procedure were deliberately skipped. Cassandra, on the other hand, added the statement for a different purpose (for details, see the relevant issue) and it was supposed to be used by the user by design. The statement is also documented as such. Since we want to preserve compatibility with Cassandra, we document the statement and its semantics in the user documentation, explicitly implying that it can be used by the user. We also add a test verifying that logging in works correctly. Fixes scylladb/scylladb#21691 Backport: not needed. The relevant code didn't make it to 6.2 or any previous version of OSS. Closes scylladb/scylladb#21752 * github.com:scylladb/scylladb: docs: Update documentation on CREATE ROLE WITH HASHED PASSWORD test/boost: Add test for creating roles with hashed passwords	2025-01-14 15:33:30 +02:00
Kefu Chai	3b7a991f74	ent/encryption: rename "sie" to "get_opt" "sie" is the short for "system info encryption". it is a wrapper around a `opts` map so we can get the individual option by providing a default value via an `optional<>` return value. but "sie" could be difficult to understand without more context. and it is used like a function -- we get the individual option using its operator(). so, in order to improve the readability, in this change, we rename it to "get_opt". Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-14 21:08:17 +08:00
Kefu Chai	92c6c8a32f	ent,main: fix misspellings these misspellings are identified by codespell. they are either in comment or logging messages. let's fix them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-14 21:08:17 +08:00
Kefu Chai	7215d4bfe9	utils: do not include unused headers these unused includes were identifier by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. please note, because quite a few source files relied on `utils/to_string.hh` to pull in the specialization of `fmt::formatter<std::optional<T>>`, after removing `#include <fmt/std.h>` from `utils/to_string.hh`, we have to include `fmt/std.h` directly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-14 07:56:39 -05:00
Kefu Chai	e6b05cb9ea	.github: use the toolchain specified by tools/toolchain/image Previously, we hardwire the container to a previous frozen toolchain image. but at the time of writing, the tree does not compile in the specified toolchain image anymore, after the required building environment is updated, and toolchain was updated accordingly. in order to improve the maintability, let's reuse `read-toolchain.yaml` job which reads `tools/toolchain/image`, so we don't have to hardwire the container used for building the tree with the latest seastar. this should address the build failure surfaced recently. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22287	2025-01-14 07:56:38 -05:00
Kefu Chai	f8885a4afd	dist/docker,docs: replace "--experimental" with "--experimental-features" The "--experimental" option was removed in commit `f6cca741ea`. Using this deprecated option now causes Scylla to fail with the error: ``` error: the argument ('on') for option '--experimental-features' is invalid ``` So, in this change, let's update the docker entry point script to use `--experimental-features` command line option instead. The related document is updated accordingly. Fixes scylladb/scylladb#22207 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22283	2025-01-14 07:56:38 -05:00
Aleksandra Martyniuk	592512fd0f	test: fix memtable_flush_period test memtable_flush_period test sets the flush period to 200ms and checks whether the data is flushed after 500ms. When flush period is set, the timer is armed with the given value. On expiration, memtables are flushed and then the timer is rearmed. There is no certainty that during 500ms the flush finishes, though. Check if after 500ms flush has started. Wait until there is an sstable. Fixes: #21965. Closes scylladb/scylladb#22162	2025-01-14 07:56:38 -05:00
Aleksandra Martyniuk	32ab58cdea	repair: add repair_service gate In main.cc storage_service is started before and stopped after repair_service. storage_service keeps a reference to sharded repair_service and calls its methods, but nothing ensures that repair_service's local instance would be alive for the whole execution of the method. Add a gate to repair_service and enter it in storage_service before executing methods on local instances of repair_service. Fixes: #21964. Closes scylladb/scylladb#22145	2025-01-14 07:56:38 -05:00
Geoff Montee	25e8478051	docs: rest.rst: use latest docker tag to view Swagger UI for REST API Closes scylladb/scylladb#21681	2025-01-14 07:56:38 -05:00
Botond Dénes	686a997c04	Merge 'Complete implementation of configuring IO bandwidth limits' from Pavel Emelyanov In Scylla there are two options that control IO bandwidth limit -- the /storage_service/(compaction\|stream)_throughput REST API endpoints. The endpoints are partially implemented and have no counterparts in the nodetool. This set implements the missing bits and adds tests for new functionality. Closes scylladb/scylladb#21877 * github.com:scylladb/scylladb: nodetool: Implement [gs]etstreamthroughput commands nodetool: Implement [gs]etcompationthroughput commands test: Add validation of how IO-updating endpoints work api: Implement /storage_service/(stream\|compaction)_throughput endpoints api: Disqualify const config reference api: Implement /storage_service/stream_throughput endpoint api: Move stream throughput set/get endpoints from storage service block api: Move set_compaction_throughput_mb_per_sec to config block util: Include fmt/ranges.h in config_file.hh	2025-01-14 07:56:38 -05:00
Aleksandra Martyniuk	94f4871352	test: start waiting for task before it gets aborted Ensure that the repair task was aborted after wait API acknowledged its existence. Fixes: #22011. Closes scylladb/scylladb#22012	2025-01-14 07:56:37 -05:00
Michael Litvak	7a6aec1a6c	view_builder: hold semaphore during entire startup Guard the whole view builder startup routine by holding the semaphore until it's done instead of releasing it early, so that it's not intercepted by migration notifications.	2025-01-14 12:31:29 +02:00
Michael Litvak	1104411f83	view_builder: pass view name by value to write_view_build_status The function write_view_build_status takes two lambda functions and chooses which of them to run depending on the upgrade state. It might run both of them. The parameters ks_name and view_name should be passed by value instead of by reference because they are moved inside each lambda function. Otherwise, if both lambdas are run, the second call operates on invalid values that were moved.	2025-01-14 12:31:29 +02:00
Michael Litvak	b1be2d3c41	view_builder: write status to tables before starting to build When adding a new view for building, first write the status to the system tables and then add the view building step that will start building it. Otherwise, if we start building it before the status is written to the table, it may happen that we complete building the view, write the SUCCESS status, and then overwrite it with the STARTED status. The view_build_status table will remain in incorrect state indicating the view building is not complete. Fixes scylladb/scylladb#20638	2025-01-14 12:31:20 +02:00
Asias He	cd96fb5a78	repair: Add repair_hosts_filter and repair_dcs_filter They will be useful for hosts and DCs selection for the repair scheduler. It is not implemented yet. Adding it earlier, so we do not need to change the system tabler later. Closes scylladb/scylladb#21985	2025-01-14 08:46:26 +02:00
Geoff Montee	c8ca2bd212	docs: operating-scylla/admin-tools/virtual-tables.rst: fix link to virtual tables Closes scylladb/scylladb#22198	2025-01-14 08:45:49 +02:00
Lakshmi Narayanan Sreethar	63100b34da	sstable_directory: do not load remote sstables in process_descriptor The sstable loader relied on the generation id to provide an efficient hint about the shard that owns an sstable. But, this hint was rendered ineffective with the introduction of UUID generation, as the shard id was no longer embedded in the generation id. This also became suboptimal with the introduction of tablets. Commit `0c77f77` addressed this issue by reading the minimum from disk to determine sstable ownership but this improvement was lost with commit `63f1969`, which optimistically assumed that hints would work most of the time, which isn't true. This commit restores that change - shard id of a table is deduced by reading minially from disk and then the sstable is fully loaded only if it belongs to the local shard. This patch also adds a testcase to verify that the sstable are loaded only in their respective shards. Fixes #21015 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-01-13 20:01:30 +05:30
Lakshmi Narayanan Sreethar	6e3ecc70a6	sstable_directory: update `load_sstable()` definition Updated `sstable_directory::load_sstable()` to directly accept `data_dictionary::storage_options` instead of a function that returns the same. This is required to ensure `process_descriptor()` loads the sstable only once in the right shard. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-01-13 20:00:29 +05:30
Nadav Har'El	321d0fd3b1	Merge 'Alternator: Add WCU suppport for update item' from Amnon Heiman This series adds WCU support for the Alternator update item. This motivation behind it, is to have a rough estimation of what a similar operation would have taken from WCU perspective if used with DynamoDB. The calculation is done while minimal overhead is the prime objective, the results are values that is less or equal to what it would have been in DynamoDB New feature, no need to backport. Closes scylladb/scylladb#21999 * github.com:scylladb/scylladb: alternator/test_returnconsumedcapacity.py: update item alternator/executor.cc: Add WCU for update_item	2025-01-13 14:35:46 +02:00
Kamil Braun	48a4efba2f	Merge 'Fix possible data corruption due to token keys clashing in read repair.' from Sergey Zolotukhin This update addresses an issue in the mutation diff calculation algorithm used during read repair. Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on the Murmur3 hash function, it could generate duplicate values for different partition keys, causing corruption in the affected rows' values. Fixes scylladb/scylladb#19101 Since the issue affects all the relevant scylla versions, backport to: 6.1, 6.2 Closes scylladb/scylladb#21996 * github.com:scylladb/scylladb: storage_proxy/read_repair: Remove redundant 'schema' parameter from `data_read_resolver::resolve` function. storage_proxy/read_repair: Use `partition_key` instead of `token` key for mutation diff calculation hashmap. test: Add test case for checking read repair diff calculation when having conflicting keys.	2025-01-13 10:54:34 +01:00
Kamil Braun	88a48f2355	Merge 'Load peers table into the gossiper on boot' from Gleb Since we manage ip to id mapping directly in gossiper now we need to load the mapping on boot. We already do it anyway, but only due to a bug which checks raft topology mode config before it is set, so the code thinks that it is in the gossiper mode and loads peers table into the gossiper and token metadata. Fix the bug and load peers into the gossiper only since token metadata is managed by raft. The series also removes address map related test that no longer checks anything and replace it with unit test. It also adds the dc/rack check to "join node" rpc. The check is done during shadow round now, but for it to work it requires dc/rack to be propagated through the gossiper and we want to eventually drop it. Ref: scylladb/scylladb#21777 * 'load-peers' of https://github.com/gleb-cloudius/scylla: topology coordinator: reject replace request if topology does not match gossiper: fix the logic of shadow_round parameter storage_service: do not add endpoint to the gossiper during topology loading. storage_service: load peers into gossiper on boot in raft topology mode storage_service: set raft topology change mode before using it in join_cluster locator: drop inet_address usage to figure out per dc/rack replication test: drop test_old_ip_notification_repro.py test: address_map: check generation handling during entry addition	2025-01-13 09:40:36 +01:00
Pavel Emelyanov	65f52db3a8	api: Hide parse_tables() helper It's no longer used outside of api/storage_service.cc file. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-13 11:32:08 +03:00
Pavel Emelyanov	cf0dc8f90a	api: Use parse_table_infos() in stop_keyspace_compaction handler It now parses only table names from its "cf" argument. Parsing table_infos has two benefits -- it makes it possible to hide parse_tables() thus keeping less validation code around, and the subsequent db.find_column_family() call can avoid re-lookup of table uuid by its ks:table pair. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-13 11:32:07 +03:00
Pavel Emelyanov	fb09a645b8	api: Re-use parse_table_info() in column_family API Several places call parse_fully_qualified_cf_name() and get_uuid() helpers one after another. Previous patch introduced the parse_table_info() one that wraps both. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-13 11:32:07 +03:00
Pavel Emelyanov	789f468f39	api: Make get_uuid() return table_info (and rename) The method gets "fully qualified" table name, which is 'ks:cf' string and returns back the resolved table_id value. Some callers will benefit from knowing the parsed 'cf' part of it (see next patch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-13 11:32:07 +03:00
Pavel Emelyanov	b55c05c9d0	api: Remove keyspace argument from for_table_on_all_shards() This argument is needed to find table by ks:cf prair. The "table" part is taken from the vector of table_info-s, but table_info-s have table_id value onboard, and the table can be found by this id. So keyspace is not needed any longer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-13 11:32:07 +03:00
Pavel Emelyanov	84ad9fe82b	api: Switch for_table_on_all_shards() to use table_info-s All callers of it already have one. Next patch will make even more use of those passed table_info-s. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-13 11:32:07 +03:00
Pavel Emelyanov	87cdf25891	api: Hide validate_table() helper It's no longer used outside of api/storage_service.cc. It's not yet possible to remove it completely, but it's better not to encourage others to use it outside of its current .cc file. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-13 11:32:07 +03:00
Pavel Emelyanov	5a038fba39	api: Tables vector is never empty now in for_table_on_all_shards() Callers of this method provide vectors of two kinds: - explicitly single-entry one from endpoints that work on single table - vector returned by parse_table_infos() The latter helper, if it gets empty list of tables from user, populates its return value with all tables from the given keyspace. The removed check became obsolete after recent changes. Prior to those, the 2nd case provided vector from another helper called parse_tables(), which could return empty result. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-13 11:32:07 +03:00
Pavel Emelyanov	2016b12252	api: Move vectors of tables, not copy The set_tables_...() helper called here accept vector by value, so the existing code copies it. It's better to move, all the more so next changes will make this place pass vectors with more data onboard. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-13 11:32:07 +03:00
Pavel Emelyanov	bf715ca614	api: Add table validation to set_compaction_strategy_class endpoint This handler doesn't check if the requested table exists. If it doesn't it will throw later anyway, but most of other endpoints that work with tables check table early. This early check allows throwing bad-param exception on missing table, not internal-server-error one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-13 11:32:07 +03:00
Pavel Emelyanov	e35245de36	api: Use get_uuid() to validate_table() in column family API This helper returns uuid, but also "Validates" the table exists by calling db.find_uuid() and throwing bad_param exception on error. This change will allow making for_table_on_all_shards() smaller a bit later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-13 11:32:07 +03:00
Pavel Emelyanov	6ab5bade21	api: Use parse_table_infos() in column family API The one is the same as parse_tables(), but returns back name:id pairs. This change will allow making for_table_on_all_shards() smaller a bit later, as well as removing the parse_tables() code eventually. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-13 11:32:07 +03:00
Andrei Chekun	2aea2610e0	test.py: Wait for tasks finish before going further Developers using asyncio.gather() often assume that it waits for all futures (awaitables) givens. But this isn't true when the return_exceptions parameter is False, which is the default. In that case, as soon as one future completes with an exception, the gather() call will return this exception immediately, and some of the finished tasks may continue to run in the background. This is bad for applications that use gather() to ensure that a list of background tasks has all completed. So such applications must use asyncio.gather() with return_exceptions=True, to wait for all given futures to complete either successfully or unsuccessfully. Closes scylladb/scylladb#22252	2025-01-13 09:43:28 +02:00
Botond Dénes	f899f0e411	tools/scylla-sstable: dump-statistics: fix handling of {min,max}_column_names Said fields in statistics are of type `disk_array<uint32_t, disk_string<uint16_t>>` and currently are handled as array of regular strings. However these fields store exploded clustering keys, so the elements store binary data and converting to string can yield invalid UTF-8 characters that certain JSON parsers (jq, or python's json) can choke on. Fix this by treating them as binary and using `to_hex()` to convert them to string. This requires some massaging of the json_dumper: passing field offset to all visit() methods and using a caller-provided disk-string to sstring converter to convert disk strings to sstring, so in the case of statistics, these fields can be intercepted and properly handled. While at it, the type of these fields is also fixed in the documentation. Before: "min_column_names": [ "��Z��\u0011�\u0012ŷ4^��<", "�2y\u0000�}\u007f" ], "max_column_names": [ "��Z��\u0011�\u0012ŷ4^��<", "}��B\u0019l%^" ], After: "min_column_names": [ "9dd55a92bc8811ef12c5b7345eadf73c", "80327900e2827d7f" ], "max_column_names": [ "9dd55a92bc8811ef12c5b7345eadf73c", "7df79242196c255e" ], Fixes: #22078 Closes scylladb/scylladb#22225	2025-01-13 09:19:04 +03:00
Botond Dénes	a21ecc3253	tools/scylla-sstable: also try reading scylla.yaml from /etc/scylla scylla-sstable tries to read scylla.yaml via the following sequence: 1) Use user-provided location is provided (--scylla-yaml-file parameter) 2) Use the environment variables SCYLLA_HOME and/or SCYLLA_CONF if set 3) Use the default location ./conf/scylla.yaml Step 3 is fine on dev machines, where the binaries are usually invoked from scylla.git, which does have conf/scylla.yaml, but it doesn't work on production machines, where the default location for scylla.yaml is /etc/scylla/scylla.yaml. To reduce friction when used on production machines, add another fallback in case (3) fails, which tries to read scylla.yaml from /etc/scylla/scylla.yaml location. Fixes: scylladb/scylladb#22202 Closes scylladb/scylladb#22241	2025-01-13 09:11:29 +03:00
Kefu Chai	752e6561fb	test/pylib: log if scylla exits with non-zero status code When destroying a test cluster, ScyllaCluster.stop() calls ScyllaServer.stop() for each running server. Previously, non-zero exit status codes from scylla servers were silently ignored during test teardown. This change modifies the logging behavior to print the exit status code when a scylla server exits with a non-zero status. This helps developers quickly identify potential issues or unexpected terminations during test runs. Differences in handling: - Before: Non-zero exit codes were not logged - After: Non-zero exit codes are printed, providing visibility into server termination errors This improvement aids in diagnosing intermittent test failures or unexpected server shutdowns during test execution. Refs #21742 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21934	2025-01-13 09:09:43 +03:00
Kefu Chai	41de3a17e1	api: move histogram data into future to avoid deep copying Previously, we created a vector<utils_json::histogram> and returned it by copying into a future. Since histogram is a JSON representation of ihistogram, it can be heavyweight, making the vector copy overhead significant. Now we move the vector into the returned future instead of copying it, eliminating the deep copy overhead. The APIs backed by this function are marked deprecated, so this performance improvement is not that important. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22004	2025-01-13 09:08:15 +03:00
Kefu Chai	fbca0a08f7	build: cmake: do not add absl::headers as a link directory In `0b0e661a85`, we brought abseil back as a submodule, and we added absl::headers as an interface library for importing abseil headers' include directory. And: ```console $ patchelf --print-rpath build/RelWithDebInfo/scylla /home/kefu/dev/scylla/idl/absl::headers ``` In this change, we remove `absl::headers` from `target_link_directories()` as it's an interface library that only provides header files, not linkable libraries. This fixes the incorrect inclusion of absl::headers in the rpath of the scylla executable. Additionally, remove abseil library dependencies from the idl target since none of the idl source files directly include abseil headers. After this change, ```console $ patchelf --print-rpath build/RelWithDebInfo/scylla ``` the output of `pathelf` is now empty. Fixes #22265 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22266	2025-01-13 09:05:43 +03:00
Kefu Chai	d815d7013c	sstables_loader: report progress with the unit of batch We restore a snapshot of table by streaming the sstables of the given snapshot of the table using `sstable_streamer::stream_sstable_mutations()` in batches. This function reads mutations from a set of sstables, and streams them to the target nodes. Due to the limit of this function, we are not able to track the progress in bytes. Previously, progress tracking used individual sstables as units, which caused inaccuracies with tablet-distributed tables, where: - An sstable spanning multiple tablets could be counted multiple times - Progress reporting could become misleading (e.g., showing "40" progress for a table with 10 sstables) This change introduces a more robust progress tracking method: - Use "batch" as the unit of progress instead of individual sstables. Each batch represents a tablet when restoring a table snapshot if the tablet being restored is distributed with tablets. When it comes to tables distributed with vnode, each batch represents an sstable. - Stream sstables for each tablet separately, handling both partially and fully contained sstables - Calculate progress based on the total number of sstables being streamed - Skip tablet IDs with no owned tokens For vnode-distributed tables, the number of "batches" directly corresponds to the number of sstables, ensuring: - Consistent progress reporting across different table distribution models - Simplified implementation - Accurate representation of restore progress The new approach provides a more reliable and uniform method of tracking restoration progress across different table distribution strategies. Also, Corrected the use of `_sstables.size()` in `sstable_streamer::stream_sstables()`. It addressed a review comment from Pavel that was inadvertently overlooked during previous rebasing the commit of `5ab4932f34`. Fixes scylladb/scylladb#21816 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21841	2025-01-13 09:04:35 +03:00
Dawid Mędrek	d1f960eee2	test/topology_custom: Add test for Scylla with disabled view building Before this commit, there doesn't seem to have been a test verifying that starting and shutting down Scylla behave correctly when the configuration option `view_building` is set to false. In these changes, we add one.	2025-01-13 00:41:27 +01:00
Dawid Mędrek	06ce976370	main, view: Pair view builder drain with its start In these changes, we pair draining the view builder with its start. To better understand what was done and why, let's first look at the situation before this commit and the context of it: (a) The following things happened in order: 1. The view builder would be constructed. 2. Right after that, a deferred lambda would be created to stop the view builder during shutdown. 3. group0_service would be started. 4. A deferred lambda stopping group0_service would be created right after that. 5. The view builder would be started. (b) Because the view builder depends on group0_client, it couldn't be started before starting group0_service. On the other hand, other services depend on the view builder, e.g. the stream manager. That makes changing the order of initialization a difficult problem, so we want to avoid doing that unless we're sure it's the right choice. (c) Since the view builder uses group0_client, there was a possibility of running into a segmentation fault issue in the following scenario: 1. A call to `view_builder::mark_view_build_success()` is issued. 2. We stop group0_service. 3. `view_builder::mark_view_build_success()` calls `announce_with_raft()`, which leads to a use-after-free because group0_service has already been destroyed. This very scenario took place in scylladb/scylladb#20772. Initially, we decided to solve the issue by initializing group0_service a bit earlier (scylladb/scylladb@7bad8378c7). Unfortunately, it led to other issues described in scylladb/scylladb#21534. We reverted that change in the previous commit. These changes are the second attempt to the problem where we want to solve it in a safer manner. The solution we came up with is to pair the start of the view builder with a deferred lambda that deinitializes it by calling `view_builder::drain()`. No other component of the system should be able to use the view builder anymore, so it's safe to do that. Furthermore, that pairing makes the analysis of initialization/deinitialization order much easier. We also solve the aformentioned use-after-free issue because the view builder itself will no longer attempt to use group0_client. Note that we still pair a deferred lambda calling `view_builder::stop()` with the construction of the view builder; that function will also call `view_builder::drain()`. Another notable thing is `view_builder::drain()` may be called earlier by `storage_service::do_drain()`. In other words, these changes cover the situation when Scylla runs into a problem when starting up. Fixes scylladb/scylladb#20772	2025-01-13 00:41:22 +01:00
Dawid Mędrek	a5715086a4	Revert "main,cql_test_env: start group0_service before view_builder" The patch solved a problem related to an initialization order (scylladb/scylladb#20772), but we ran into another one: scylladb/scylladb#21534. After moving the initialization of group0_service, it ended up being destroyed AFTER the CDC generation service would. Since CDC generations are accessed in `storage_service::topology_state_load()`: ``` for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) { rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id); co_await _cdc_gens.local().handle_cdc_generation(gen_id); ``` we started getting the following failure: ``` Service &seastar::sharded<cdc::generation_service>::local() [Service = cdc::generation_service]: Assertion `local_is_initialized()' failed. ``` We're reverting the patch to go back to a more stable version of Scylla and in the following commit, we'll solve the original issue in a more systematic way. This reverts commit `7bad8378c7`.	2025-01-12 18:13:56 +01:00
Avi Kivity	814942505f	Merge 'Introduce Encryption-at-Rest (EAR) for sstables and commitlog' from Calle Wilund Fixes https://github.com/scylladb/scylla-enterprise/issues/5016#issuecomment-2558464631 EAR - encryption at rest. Allows on-disk file encryption of sstables and commitlog data. Introduces OpenSSL based file level encrypted storage, managed via a set of providers ranging from local files to cloud KMS providers. For a more comprehensive explanation, see the included docs (or if possible, original source tree). Manual bulk merge of EAR feature from enterprise repo to main scylla repo. Breaks some features apart, but main EAR is still a humongous commit, because to separate this I would have to mess with code incrementally, adding time and risk. This PR includes the local file gen tool, tests and also p11 validation. Note: CI will not execute the full tests unless master CI is set to provide the same environment as the enterprise one. Not sure about the status of this ATM. Note: Includes code to compile against cryptsoft kmipc SDK, but not the SDK. If you happen to check out this tree in the scylla folder and configure, it will be linked against and KMIP functionality will be enabled, otherwise not. Closes scylladb/scylladb#22233 * github.com:scylladb/scylladb: docs: Add EAR docs main/build: Add p11-kit and initialize tools: Add local-file-key-generator tool tests: Add EAR tests tmpdir: shorten test tempdir path EAR: port the ear feature from enterprise cql_test_env: Add optional query timeout schema/migration_manager: Add schema validate sstables: add get_shared_components accessor config/config_file: Add exports and definitions of config_type_for<>	2025-01-12 16:10:46 +02:00
Kefu Chai	e71ac35426	mutation_writer,redis: do not include unused headers the changes porting enterprise features to oss brought some used include to the tree. so let's remove them. these unused includes were identified by clang-include-cleaner. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22246	2025-01-12 16:07:17 +02:00
Yaron Kaikov	6f30d26f2a	Update tools/cqlsh submodule * tools/cqlsh b09bc793...52c61306 (3): > cleanup: remove un-used Dockerfiles > .github/workflows/build-push.yml: update to newer macos images > cython: fix the usage of cython Closes scylladb/scylladb#22250	2025-01-12 16:06:30 +02:00
Jenkins Promoter	a7d8d21e86	Update pgo profiles - aarch64	2025-01-12 15:27:50 +02:00
Jenkins Promoter	b4ca9489c4	Update pgo profiles - x86_64	2025-01-12 15:05:40 +02:00
Benny Halevy	8d2ff8a915	utils: add disk_space_monitor Instantiated only on shard 0. Currently, only subscribe from unit test Manual unit test using loop mount was added. Note that the test requires sudo access and root access to /dev/loop, so it cannot run in rootless podman instance, and it'd fail with Permission denied. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#21523	2025-01-12 14:51:15 +02:00
Piotr Smaron	288f9b2b15	Introduce LDAP role manager & saslauthd authenticator This PR extends authentication with 2 mechanisms: - a new role_manager subclass, which allows managing users via LDAP server, - a new authenticator, which delegates plaintext authentication to a running saslauthd daemon. The features have been ported from the enterprise repository with their test.py tests and the documentation as part of changing license to source available. Fixes: scylladb/scylla-enterprise#5000 Fixes: scylladb/scylla-enterprise#5001 Closes scylladb/scylladb#22030	2025-01-12 14:50:29 +02:00
Nadav Har'El	31c6a33666	Merge 'error_injection: replace boost::lexical_cast with std::from_chars' from Avi Kivity Replace boost with a standard facility; this reduces dependencies as lexical_cast depends on boost ranges. Since std::from_chars() is chatty, we introduce utils::from_chars_exactly() to trade some flexibility for conciseness. Small build time improvement, no backport needed. Closes scylladb/scylladb#22164 * github.com:scylladb/scylladb: error_injection: replace boost::lexical_cast with std::from_chars utils: introduce from_chars_exactly()	2025-01-12 14:38:54 +02:00
Lakshmi Narayanan Sreethar	d2ba45a01f	sstable_directory: reintroduce `get_shards_for_this_sstable()` Reintroduce `get_shards_for_this_sstable()` that was removed in commit ad375fbb. This will be used in the following patch to ensure that an sstable is loaded only once. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-01-10 23:32:58 +05:30
Aleksandra Martyniuk	1d46bdb1ad	test: boost: check resize_task_info in tablet_test.cc	2025-01-10 16:04:19 +01:00
Aleksandra Martyniuk	b11c21e901	test: add tests to check revoked resize virtual tasks The test is skipped in debug mode, because the preparation of revoke takes too long and wait request, which needs to be started before the preparation, hits timeout.	2025-01-10 16:04:11 +01:00
Aleksandra Martyniuk	50c9c0d898	test: add tests to check the list of resize virtual tasks	2025-01-10 10:03:09 +01:00
Aleksandra Martyniuk	2ed4bad752	test: add tests to check spilt and merge virtual tasks status	2025-01-10 10:03:09 +01:00
Aleksandra Martyniuk	48e0843767	test: test_tablet_tasks: generalize functions	2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk	062f155fd6	replica: service: add split virtual task's children offstrategy_compaction_task_executor and split_compaction_task_executor running as a part of the split become children of a split virtual task.	2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk	7ef6900837	replica: service: pass parent info down to storage_group::split Pass task_info down to storage_group::split. In the following patches, it will be used to set the parent of offstrategy_compaction_task_executor and split_compaction_task_executor running as a part of the split. The task_info param will contain task info of a split virtual task.	2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk	14dcaecc29	tasks: children of virtual tasks aren't internal by default Currently, streaming_task_impl is the only existing child of any virtual task. It overrides the is_internal definition so that it is non-internal even though it has a parent. This should apply to all children of all virtual tasks. Modify task_manager::task::impl::is_internal so that children of virtual tasks aren't internal by default.	2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk	5a948d3fac	tasks: initialize shard in task_info ctor Initialize shard in task_info constructor. All current usages do not care about the shard of an empty task_info. In the following patches we may need that for setting info about virtual task parent.	2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk	840bcdc158	service: extend tablet_virtual_task::abort Set resize tasks as non abortable.	2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk	639470d256	service: retrun status_helper struct from tablet_virtual_task::get_status_helper	2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk	0c7bef6875	service: extend tablet_virtual_task::wait Extend tablet_virtual_task::wait to support resize tasks. To decide what is a state of a finished resize virtual task (done or failed), the tablet count is checked. The task state is set to done, if the tablet count before resize is different than after.	2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk	24bbd161fd	tasks: add suspended task state Add suspended task state. It will be used for revoke resize requests.	2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk	adf6b3f3ff	service: extend tablet_virtual_task::get_status Extend tablet_virtual_task::get_status to cover resize tasks.	2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk	78215d64d1	service: extend tablet_virtual_task::contains Extend tablet_virtual_task::contains to check resize operations. Methods that do not support resize tasks return immediately if they are handling split or merge task.	2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk	0df64e18fb	service: extend tablet_virtual_task::get_stats Extend tablet_virtual_task::get_stats to list resize tasks.	2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk	a8d7f4d89a	service: add service::task_manager_module::get_nodes	2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk	3f6b932362	tasks: add task_manager::get_nodes Move an implementation of node_ops::task_manager_module::get_nodes to task_manager::get_nodes, so that it can be reused by other modules.	2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk	5dfac9290c	tasks: drop noexcept from module::get_nodes	2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk	18b829add8	replica: service: add resize_task_info static column to system.tablets Add resize_task_info static column to system.tablets. Set or delete resize_task_info value when the resize_decision is changed. Reflect the column content in tablet_map.	2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk	b6b4b767de	locator: extend tablet_task_info to cover resize tasks	2025-01-10 10:03:07 +01:00
Michael Litvak	2a8ff478f0	view_builder: register listener for new views before reading views When starting the view builder, we find all existing views in `calculate_shard_build_step` and then register a listener for new views. Between these steps we may yield and create a new view, then we miss initializing the view build step for the new view, and we won't start building it. To fix this we first register the listener and then read existing views, so a view can't be missed. Fixes scylladb/scylladb#20338 Closes scylladb/scylladb#22184	2025-01-09 13:18:28 +02:00
Calle Wilund	8e828f608d	docs: Add EAR docs Merge docs relating to EAR.	2025-01-09 10:40:47 +00:00
Calle Wilund	083f735366	main/build: Add p11-kit and initialize For p11 certification/validation	2025-01-09 10:40:47 +00:00
Calle Wilund	f901beec87	tools: Add local-file-key-generator tool For generating key files for local provider	2025-01-09 10:40:47 +00:00
Calle Wilund	c596ae6eb1	tests: Add EAR tests Adds the migrated EAR/encryption tests. Note: Until scylla CI is updated to provide all the proper ENV vars, some tests will not execute.	2025-01-09 10:40:39 +00:00
Calle Wilund	ee62b61c84	tmpdir: shorten test tempdir path To make certain python tests work in CI	2025-01-09 10:37:35 +00:00
Calle Wilund	723518c390	EAR: port the ear feature from enterprise Bulk transfer of EAR functionality. Includes all providers etc. Could maybe break up into smaller blocks, but once it gets down to the core of it, would require messing with code instead of just moving. So this is it. Note: KMIP support is disabled unless you happen to have the kmipc SDK in your scylla dir. Adds optional encryption of sstables and commitlog, using block level file encryption. Provides key sourcing from various sources, such as local files or popular KMS systems.	2025-01-09 10:37:26 +00:00
Avi Kivity	9ff6473691	error_injection: replace boost::lexical_cast with std::from_chars Replace boost with a standard facility; this reduces dependencies as lexical_cast depends on boost ranges. As a side effect the exception error message is improved.	2025-01-09 11:14:51 +02:00
Avi Kivity	224dc34089	utils: introduce from_chars_exactly() This is a replacement for boost::lexical_cast (but without its long dependency chain). It wraps std::from_chars(), providing a less flexible but also more concise interface.	2025-01-09 11:14:49 +02:00
Michał Chojnowski	1728f9c983	utils/dict_trainer: silence an ERROR log when raft is aborted during dict publication The dict publication routine might throw raft::request_aborted when the node is aborted. This doesn't deserve an ERROR log. Let's demote the log printed in this case from ERROR to DEBUG. Fixes scylladb/scylladb#22081 Closes scylladb/scylladb#22211	2025-01-08 17:55:05 +01:00
Kefu Chai	462a10c4f6	test.py: do not repeat "combined_tests" instead of repeating "combined_tests", let's define a variable for it. less repeating this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22185	2025-01-08 15:43:34 +02:00
Sergey Zolotukhin	2f1731c551	test: Include parent test name in `ScyllaClusterManager` log file names. Add the test file name to `ScyllaClusterManager` log file names alongside the test function name. This avoids race conditions when tests with the same function names are executed simultaneously. Fixes scylladb/scylladb#21807 Backport: not needed since this is a fix in the testing scripts. Closes scylladb/scylladb#22192	2025-01-08 15:42:31 +02:00
Calle Wilund	e734fc11ec	cql_test_env: Add optional query timeout Some tests need queries to actually fail.	2025-01-08 12:50:03 +00:00
Calle Wilund	511326882a	schema/migration_manager: Add schema validate Validates schema before announce. To ensure all extensions are happy.	2025-01-08 12:50:03 +00:00
Calle Wilund	9f06a0e3a3	sstables: add get_shared_components accessor To access the shared components.	2025-01-08 12:50:03 +00:00
Calle Wilund	7ed89266b3	config/config_file: Add exports and definitions of config_type_for<> Required for implementors. Other than config.cc.	2025-01-08 12:50:03 +00:00
Kefu Chai	d0a3311ced	locator: do not include unused headers these unused includes were identifier by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22199	2025-01-08 14:26:48 +02:00
Kefu Chai	866520ff89	test.py: Defer Scylla executable check until test execution Move the Scylla executable existence check from PythonTestSuite's constructor to test execution time. This allows running unit tests that don't depend on the scylla executable without building it first. Previously, PythonTestSuite's constructor would fail if the Scylla executable was missing, preventing even unrelated unit tests from running. Now, only tests that actually require Scylla will fail if the executable is missing. Fixes scylladb/scylladb#22168 Refs scylladb/scylladb#19486 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22224	2025-01-08 14:25:50 +02:00
Michael Litvak	35316a40c8	service/storage_proxy: consider all replicas participating in write for MV backpressure replica writes are delayed according to the view update backlog in order to apply backpressure and reduce the rate of incoming base writes when the backlog is large, allowing slow replicas to catch up. previously the backlog calculation considered only the pending targets, excluding targets that replied successfuly, probably due to confusion in the code. instead, we want to consider the backlog of all the targets participating in the write. Fixes scylladb/scylladb#21672 Closes scylladb/scylladb#21935	2025-01-08 12:03:26 +01:00
Botond Dénes	a2436f139f	docs/dev: review-checklist.md: expand the guide for good commit log Closes scylladb/scylladb#22214	2025-01-08 13:01:35 +02:00
Kefu Chai	f41b030fdd	repair: do not include unused header this unused include was identifier by clang-include-cleaner. after auditing task_manager_module.hh, the report has been confirmed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22200	2025-01-08 12:58:35 +02:00
Avi Kivity	de8253b98a	types: explicitly instantiate map_type_impl::deserialize() The definition of the template is in a source translation unit, but there are also uses outside the translation unit. Without lto/pgo it worked due to the definition in the translation unit, but with lto/pgo we can presume the definition was inlined, so callers outside the translation unit did not have anything to link with. Fix by explicitly instantiating the template function. Closes scylladb/scylladb#22136	2025-01-08 11:52:11 +02:00
Benny Halevy	e6efaa3b73	Update seastar submodule * seastar 3133ecdd...a9bef537 (24): > file: add file_system_space > future: avoid inheriting from future payload type > treewide: include fmt/ostream.h for using fmt::print() > build: remove messages used for debugging > demos: Rename websocket demo to websocket_server demo > demos: Add a way to set port from cmd line in websocket demo > tls: Add optional builder + future-wait to cert reload callback + expose rebuild > rwlock: add try_hold_{read,write}_lock methods > json: add moving push to json_list > github: add a step to build "check-include-style" > build: add a target for checking include style > scheduling_group: use map for key configs instead of vector > scheduling_group: fix indentation > scheduling_group: fix race between scheduling group and key creation > http: Make request writing functions public > http: Expose connection_factory implementations > metrics: Use separate type for shared metadata > file: unexpected throw from inside noexcept > metrics: Internalize metric label sets > thread: optimize maybe_yield > reactor: fix crash in pending registration task after poller dtor > net: Fix ipv6 socket_address comparision > reactor, linux-aio: factor out get_smp_count() lambda > reactor, linux-aio: restore "available_aio" meaning after "reserve_iocbs" Fixed usage of seastar metric label sets due to: scylladb/seastar@733420d57 Merge 'metrics: Internalize metric label sets' from Stephan Dollberg Closes scylladb/scylladb#22076	2025-01-08 09:37:16 +02:00
Kefu Chai	23729beeb5	docs: remove "ScyllaDB Enterprise" labels remove the "ScyllaDB Enterprise" labels in document. because there is no need to differentiate ScyllaDB Enterprise from its OSS variant, let's stop adding the "ScyllaDB Enterprise" labels to enterprise-only features. this helps to reduce the confusion. as we are still in the process of porting the enterprise features to this repo, this change does not fix scylladb/scylladb#22175. we will review the document again when completing the migration. we also take this opportunity to stop referencing "Enterprise" in the changed paragraph. Refs scylladb/scylladb#22175 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22177	2025-01-08 09:02:52 +02:00
Kefu Chai	e51b2075da	docs/kb: correct referenced git sha1 and version number in `047ce136`, we cherry-picked the change adding garbage-collection-ics.rst to the document. but it was still referencing the git sha1 and version number in enterprise. this change updates kb/garbage-collection-ics.rst, so that it * references the git commit sha1 in this repo * do not reference the version introducing this feature, as per Anna Stuchlik > As a rule, we should avoid documenting when something was > introduced or set as a default because our documentation > was versioned. Per-version information should be listed in > the release notes. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22195	2025-01-08 07:08:15 +02:00
Michał Chojnowski	9f639b176f	db/config: increase the default value of internode_compression_zstd_min_message_size from 0 to 1024 Usually, the smaller the messsage, the higher the CPU cost per each network byte saved by compression, so it often makes sense to reserve heavier compression for bigger messages (where it can make the biggest impact for a given CPU budget) and use ligher compression for smaller messages. There is a knob -- internode_compression_zstd_min_message_size -- which excludes RPC messages below certain size from being compressed with zstd. We arbitrarily set its default to 0 bytes before. Now we want to arbitrarily set it to 1024 bytes. This is based purely on intuition and isn't backed by any solid data. Fixes scylladb/scylla-enterprise#4731 Closes scylladb/scylla-enterprise#4990 Closes scylladb/scylladb#22204	2025-01-07 18:14:01 +02:00
Wojciech Mitros	d04f376227	mv: add an experimental feature for creating views using tablets We still have a number of issues to be solved for views with tablets. Until they are fixed, we should prevent users from creating them, and use the vnode-based views instead. This patch prepares the feature for enabling views with tablets. The feature is disabled by default, but currently it has no effect. After all tests are adjusted to use the feature, we should depend on the feature for deciding whether we can create materialized views in tablet-enabled keyspaces. The unit tests are adjusted to enable this feature explicitly, and it's also added to the scylla sstable tool config - this tool treats all tables as if they were tablet-based (surprisingly, with SimpleStrategy), so for it to work on views, the new feature must be enabled. Refs scylladb/scylladb#21832 Closes scylladb/scylladb#21833	2025-01-07 15:52:36 +01:00
Emil Maskovsky	115005d863	raft: refactor the voters api to allow enabling voters The raft voters api implementation only allowed to make a node to be a non-voter, but for the "limited voters" feature we need to also have the option to make the node a voter (from within the topology coordinator). Modifying the api to allow both adding and removing voters. This in particular tries to simplify the API by not having to add another set of new functions to make a voter, but having a single setter that allows to modify the node configuration to either become a voter or a non-voter. Fixes: scylladb/scylladb#21914 Refs: scylladb/scylladb#18793 Closes scylladb/scylladb#21899	2025-01-07 15:25:50 +01:00
Asias He	d719f423e5	config: Enable enable_small_table_optimization_for_rbno by default Since the problematic dtests are with the enable_small_table_optimization_for_rbno turn off now, we can enable the flag by default. https://github.com/scylladb/scylla-dtest/pull/5383 Refs: #19131 Closes scylladb/scylladb#21861	2025-01-07 16:20:36 +02:00
Asias He	935dcd69fa	repair: Remove repair_task_info only when repair is finished In case of error, repair will be moved into the end_repair stage. We should not remove repair_task_info in this case because the repair task requested by the user is not finished yet. To fix, we should remove repair_task_info at the end of repair stage. Tests are added to ensure failed repair is not reported as finished. Closes scylladb/scylladb#21973	2025-01-07 16:19:40 +02:00
Avi Kivity	748d30a34d	tools: toolchain: simplify non-emulated build procedure Avoid using temporary names and instead treat the final image tag as a temporary. The new procedure is more or less remote-final := local-x86_64 local-aarch64 += remote-final remote-final := local-aarch64 (which now contains the x86_64 image too) Closes scylladb/scylladb#21981	2025-01-07 16:17:29 +02:00
Asias He	baaee28c07	storage_service: Add tablet migration log So that both mutation and file streaming will have the same log for tablet streaming which simplifies the dtest checking. Closes scylladb/scylladb#22176	2025-01-07 15:16:37 +01:00
Emil Maskovsky	2ac9ed2073	raft: test the limited voters feature Test the limited voters feature by creating a cluster with 3 DCs, one of them disproportionately larger than the others. The raft majority should not be lost in case the large DC goes down. Fixes: scylladb/scylla#21915 Refs: scylladb/scylla#18793 Closes scylladb/scylladb#21901	2025-01-07 15:09:49 +01:00
Michael Litvak	0617564123	db/commitlog: make the commit log hard limit mandatory mark the config parameter --commitlog-use-hard-size-limit as deprecated so the default 'true' is always used, making the hard limit mandatory. Fixes scylladb/scylladb#16471 Closes scylladb/scylladb#21804	2025-01-07 15:03:56 +02:00
Anna Stuchlik	8d824a564f	doc: add troubleshooting removal with --autoremove-ubuntu This commit adds a troubleshooting article on removing ScyllaDB with the --autoremove option. Fixes https://github.com/scylladb/scylladb/issues/21408 Closes scylladb/scylladb#21697	2025-01-07 13:35:13 +01:00
Botond Dénes	b3f8c4faa7	Merge 'node_ops: filter topology_requests entries shown by node_ops_virtual_task' from Aleksandra Martyniuk node_ops_virtual_task does not filter the entries of system.topology_request and so it creates statuses of operations that aren't node ops. Filter the entries used by node_ops_virtual_task. With this change, the status of a bootstrap of the first node will not be visible. Fixes: https://github.com/scylladb/scylladb/issues/22008. Needs backport to 6.2 that introduced node_ops_virtual_task Closes scylladb/scylladb#22009 * github.com:scylladb/scylladb: test: truncate the table before node ops task checks node_ops: rename a method that get node ops entries node_ops: filter topology_requests entries	2025-01-07 14:17:01 +02:00
Dani Tweig	d984f27b23	Create urgent_issue_reminder.yml Closes scylladb/scylladb#22042	2025-01-07 14:16:17 +02:00
Kefu Chai	353b522ca0	treewide: migrate from boost::adaptors::reversed to std::views::reverse now that we are allowed to use C++23. we now have the luxury of using `std::views::reverse`. - replace `boost::adaptors::transformed` with `std::views::transform` - remove unused `#include <boost/range/adaptor/reversed.hpp>` this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-07 13:22:00 +02:00
Kefu Chai	f7fd55146d	compaction: do not include unused headers these unused includes are identified by clang-include-cleaner. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22188	2025-01-07 13:18:31 +02:00
Yaron Kaikov	b74565e83f	dist/common/scripts/scylla_raid_setup: reduce XFS metadata overhead The block size of 1k is significantly increasing metadata overhead with xfs since it reserves space upfront for btree expansion. With CRC disabled, this reservation doesn't happen. Smaller btree blocks reduce the fanout factor, increasing btree height and the reservation size. So block size implies a trade-off between write amplification and metadata size. Bigger blocks, smaller metadata, more write ampl. Smaller blocks, more metadata, and less write ampl. Let's disable both `rmapbt` and `relink` since we replicate data, and we can afford to rebuild a replica on local corruption. Fixes: https://github.com/scylladb/scylladb/issues/22028 Closes scylladb/scylladb#22072	2025-01-07 13:18:21 +02:00
Botond Dénes	69150f0680	Merge 'Fix edge case issues related to tablet draining ' from Tomasz Grabiec Main problem: If we're draining the last node in a DC, we won't have a chance to evaluate candidates and notice that constraints cannot be satisfied (N < RF). Draining will succeed and node will be removed with replicas still present on that node. This will cause later draining in the same DC to fail when we will have 2 replicas which need relocaiton for a given tablet. The expected behvior is for draining to fail, because we cannot keep the RF in the DC. This is consistent, for example, with what happens when removing a node in a 2-node cluster with RF=2. Fixes #21826 Secondary problem: We allowed tablet_draining transition to be exited with undrained nodes, leaving replicas on nodes in the "left" state. Third problem: We removed DOWN nodes from the candidate node set, even when draining. This is not safe because it may lead to overload. This also makes the "main problem" more likely by extending it to the scenario when the DC is DOWN. The overload part in not a problem in practice currently, since migrations will block on global topology barrier if there are DOWN nodes. Closes scylladb/scylladb#21928 * github.com:scylladb/scylladb: tablets: load_balancer: Fail when draining with no candidate nodes tablets: load_balancer: Ignore skip_list when draining tablets: topology_coordinator: Keep tablet_draining transition if nodes are not drained	2025-01-07 13:04:00 +02:00
Botond Dénes	173fad296a	tools/schema_loader.cc: remove duplicate include of short_streams.hh Closes scylladb/scylladb#21982	2025-01-07 13:03:17 +02:00
David Garcia	66a5e7f672	docs: update Sphinx configuration for unified repository publishing This change is related to the unification of enterprise and open-source repositories. The Sphinx configuration is updated to build documentation either for `docs.scylladb.com/manual` or `opensource.docs.scylladb.com`, depending on the flag passed to Sphinx. By default, it will build docs for `docs.scylladb.com/manual`. If the `opensource` flag is passed, it will build docs for `opensource.docs.scylladb.com`, with a different set of versions. This change will prepare the configuration to publish to `docs.scylladb.com/manual` while allowing the option to keep publishing and editing docs with a different multiversion configuration. Note that this change will continue publishing docs to `opensource.docs.scylladb.com` for now since the `opensource` flag is being passed in the `gh-pages.yml` branch. chore: remove comment chore: update project name Closes scylladb/scylladb#22089	2025-01-07 12:54:51 +02:00
Kefu Chai	e4463b11af	treewide: replace boost::algorithm::join() with fmt::join() Replace usages of `boost::algorithm::join()` with `fmt::join()` to improve performance and reduce dependency on Boost. `fmt::join()` allows direct formatting of ranges and tuples with custom separators without creating intermediate strings. When formatting comma-separated values into another string, fmt::join() avoids the overhead of temporary string creation that `boost::algorithm::join()` requires. This change also helps streamline our dependencies by leveraging the existing fmt library instead of Boost.Algorithm. To avoid the ambiguity, some caller sites were updated to call `seastar::format()` explicitly. See also - boost::algorithm::join(): https://www.boost.org/doc/libs/1_87_0/doc/html/string_algo/reference.html#doxygen.join_8hpp - fmt::join(): https://fmt.dev/11.0/api/#ranges-api Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22082	2025-01-07 12:45:05 +02:00
Aleksandra Martyniuk	a91e03710a	repair: check tasks local to given shard Currently task_manager_module::is_aborted checks the tasks local to caller's shard on a given shard. Fix the method to check the task map local to the given shard. Fixes: #22156. Closes scylladb/scylladb#22161	2025-01-06 21:53:54 +02:00
Kefu Chai	d3f3e2a6c8	.github: add more subdirectories to CLEANER_DIR in order to prevent future inclusion of unused headers, let's include - mutation_writer - node_ops - redis - replica subdirectories to CLEANER_DIR, so that this workflow can identify the regressions in future. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22050	2025-01-06 21:28:39 +02:00
Avi Kivity	5653d13d48	Merge 'Clean up test/alternator mistakes that service levels introduced' from Nadav Har'El The recent pull request https://github.com/scylladb/scylladb/pull/22031 introduced some regressions into the test/alternator framework. For a long time now, tests can create their own CQL roles for testing role-based features. But the new service levels test changed the "run" script and test.py's "suite.yaml" to create a new role and service level just for one test. This is not only ugly (the test code is now split to two places) and unnecessary, this setup also means that you can't run this test against an already-running copy of Scylla which wasn't prepared with the "right" role and service level. Even worse - the code that was added test/alternator/run was plain wrong - it used an outdated keyspace name (the code in suite.yaml was fine). So in this patch I remove that extra run and suite.yaml code, and replace it by code inside the service level test to create the role and service level that it wants to test rather than assume it already exists. While at it, I also removed a lot of duplicate and unnecessary code from this test. After this patch, test/alternator/run returns to work correctly, after #22031 broke it. This patch fixes a recent testing-framework regression, so doesn't need to be backported (unless that regression is backported). Fixes #22047. Closes scylladb/scylladb#22172 * github.com:scylladb/scylladb: test/alternator: fix mistakes introduced with test_service_levels.py test/alternator: move "cql" fixture to test/alternator/conftest.py	2025-01-06 17:44:25 +02:00
Anna Stuchlik	047ce13641	doc: add a new KB article about tombstone garbage collection in ICS Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22174	2025-01-06 16:48:50 +02:00
Kefu Chai	8873a4e1aa	test.py: pass "count" to re.sub() with kwarg since Python 3.13, passing count to `re.sub()` as positional argument has been deprecated. and when runnint `test.py` with Python 3.13, we have following warning: ``` /home/kefu/dev/scylladb/./test.py:1540: DeprecationWarning: 'count' is passed as positional argument args.modes = re.sub(r'.* List configured modes\n(.*)\n', r'\1', ``` see also https://github.com/python/cpython/issues/56166 in order to silence this distracting warning, let's pass `count` using kwarg. this change was created in the same spirit of `c3be4a36af`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22085	2025-01-06 16:35:38 +02:00
Avi Kivity	4632e217e3	cql3: grammar: simplify unaliasedSelector production The return variable s only gets a value by assignment from the temporary tmp. Make tmp the return value instead. Closes scylladb/scylladb#22151	2025-01-06 13:06:12 +02:00
Kefu Chai	9396c2ee6c	api: include "smaller" header Previously, `api/service_levels.hh` includes `api/api.hh` for accessing symbols like `api/http_context`. but these symbols are already available in a "smaller" header -- `api/api_init.hh`. so, in order to improve the build efficiency, let's include smaller headers in favor of "larger" ones. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22178	2025-01-06 13:04:33 +02:00
Amnon Heiman	7390116620	alternator/test_returnconsumedcapacity.py: update item This patch adds tests for return consumed capacity for update_item. The tests cover: a simple update for a small object, a missing item, an update with a very large attribute (where the attribute itself is more than 1KB), and an update of a big item that uses read-before-write.	2025-01-06 09:55:17 +02:00
Nadav Har'El	fc22d5214f	Merge 'test.py: check for existence of combined test with correct path' from Kefu Chai test.py: Only check existence of Scylla executable Previously, we had inconsistent behavior around missing executables: - `561e88f0` added early failure if any executable was missing - `8b7a5ca8` added a partial skip for combined_test, but didn't properly handle build paths and artifacts This change: 1. Moves executable existence check to PythonTestSuite class 2. Only adds combined_test suite when the executable exists 3. Eliminates redundant os.access() checks 4. Corrects the path to combined_test when checking for its existence This allows running tests with a partial build while properly handling missing executables, particularly for the combined_test suite. Fixes scylladb/scylladb#22086 --- no need to backport, because the offending commit (`8b7a5ca88d`) is not included by any LTS branches yet. Closes scylladb/scylladb#22163 * github.com:scylladb/scylladb: test.py: Fix path checking for combined_test executable test.py: Throw only if scylla executable is not found	2025-01-06 09:21:01 +02:00
Nadav Har'El	e919794db8	test/alternator: fix mistakes introduced with test_service_levels.py This patch undoes multiple mistakes done when introducing the test for service levels in pull request #22031: 1. The PR introduced in test/alternator/run and test/alternator/suite.yaml a permanent role and service level that the service-level test is supposed to use. This was a mistake - the test can create the service level for its own use, using CQL, it does not need to assume such a service level already exists. It's important to fix this to allow the service level test to run against an installation of Scylla not set up by our own scripts. Moreover, while the code in suite.yaml was correct, the code in "run" was incorrect (used an outdated keyspace name). This patch removes that incorrect code. 2. The PR introduced a duplicate "cql" fixture, copied verbatim from test_cql_rbac.py (including a comment that was correct only in the latter file :-)). Let's de-duplicate it, using the fixture that I moved to conftest.py in the previous patch. 3. The PR used temporary_grant(). This needelessly complicated the test and added even more duplicate code, and this patch removes all that stuff. This test is about service levels, not RBAC and "grant". This test should just use a superuser role that has the permissions to do everything, and don't need to be granted specific permissions. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-01-05 19:40:14 +02:00
Nadav Har'El	879c0a3bd6	test/alternator: move "cql" fixture to test/alternator/conftest.py Most Alternator test use only the DynamoDB API, not CQL. Tests in test_cql_rbac.py did need CQL to set up roles and RBAC, so this file introduced a "cql" fixture to make CQL requests. A recently-introduced test/alternator/test_service_levels.py also needs access to CQL - it currently uses it for misguided reasons but the next patch will need it for creating a role and a service level. So instead of duplicating this fixture, let's move this fixture into test/alternator/conftest.py that all Alternator tests can share. The next patch will clean up this duplication in test_service_levels.py and the other mistakes it introduced. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-01-05 19:33:55 +02:00
Kefu Chai	569f8e9246	treewide: fix misspellings these misspellings were identified by codespell. let's fix them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22154	2025-01-05 16:13:09 +02:00
Raphael S. Carvalho	c973254362	Introduce incremental compaction strategy (ICS) ICS is a compaction strategy that inherits size tiered properties -- therefore it's write optimized too -- but fixes its space overhead of 100% due to input files being only released on completion. That's achieved with the concept of sstable run (similar in concept to LCS levels) which breaks a large sstable into fixed-size chunks (1G by default), known as run fragments. ICS picks similar-sized runs for compaction, and fragments of those runs can be released incrementally as they're compacted, reducing the space overhead to about (number_of_input_runs * 1G). This allows user to increase storage density of nodes (from 50% to ~80%), reducing the cost of ownership. NOTE: test_system_schema_version_is_stable adjusted to account for batchlog using IncrementalCompactionStrategy contains: compaction/: added incremental_compaction_strategy.cc (.hh), incremental_backlog_tracker.cc (.hh) compaction/CMakeLists.txt: include ICS cc files configure.py: changes for ICS files, includes test db/legacy_schema_migrator.cc / db/schema_tables.cc: fallback to ICS when strategy is not supported db/system_keyspace: pick ICS for some system tables schema/schema.hh: ICS becomes default test/boost: Add incremental_compaction_test.cc test/boost/sstable_compaction_test.cc: ICS related changes test/cqlpy/test_compaction_strategy_validation.py: ICS related changes docs/architecture/compaction/compaction-strategies.rst: changes to ICS section docs/cql/compaction.rst: changes to ICS section docs/cql/ddl.rst: adds reference to ICS options docs/getting-started/system-requirements.rst: updates sentence mentioning ICS docs/kb/compaction.rst: changes to ICS section docs/kb/garbage-collection-ics.rst: add file docs/kb/index.rst: add reference to <garbage-collection-ics> docs/operating-scylla/procedures/tips/production-readiness.rst: add ICS section some relevant commits throughout the ICS history: commit 434b97699b39c570d0d849d372bf64f418e5c692 Merge: 105586f747 30250749b8 Author: Paweł Dziepak <pdziepak@scylladb.com> Date: Tue Mar 12 12:14:23 2019 +0000 Merge "Introduce Incremental Compaction Strategy (ICS)" from Raphael " Introduce new compaction strategy which is essentially like size tiered but will work with the existing incremental compaction. Thus incremental compaction strategy. It works like size tiered, but each element composing a tier is a sstable run, meaning that the compaction strategy will look for N similar-sized sstable runs to compact, not just individual sstables. Parameters: * "sstable_size_in_mb": defines the maximum sstable (fragment) size composing a sstable run, which impacts directly the disk space requirement which is improved with incremental compaction. The lower the value the lower the space requirement for compaction because fragments involved will be released more frequently. * all others available in size tiered compaction strategy HOWTO ===== To change an existing table to use it, do: ALTER TABLE mykeyspace.mytable WITH compaction = {'class' : 'IncrementalCompactionStrategy'}; Set fragment size: ALTER TABLE mykeyspace.mytable WITH compaction = {'class' : 'IncrementalCompactionStrategy', 'sstable_size_in_mb' : 1000 } " commit 94ef3cd29a196bedbbeb8707e20fe78a197f30a1 Merge: dca89ce7a5 e08ef3e1a3 Author: Avi Kivity <avi@scylladb.com> Date: Tue Sep 8 11:31:52 2020 +0300 Merge "Add feature to limit space amplification in Incremental Compaction" from Raphael " A new option, space_amplification_goal (SAG), is being added to ICS. This option will allow ICS user to set a goal on the space amplification (SA). It's not supposed to be an upper bound on the space amplification, but rather, a goal. This new option will be disabled by default as it doesn't benefit write-only (no overwrites) workloads and could hurt severely the write performance. The strategy is free to delay triggering this new behavior, in order to increase overall compaction efficiency. The graph below shows how this feature works in practice for different values of space_amplification_goal: https://user-images.githubusercontent.com/1409139/89347544-60b7b980-d681-11ea-87ab-e2fdc3ecb9f0.png When strategy finds space amplification crossed space_amplification_goal, it will work on reducing the SA by doing a cross-tier compaction on the two largest tiers. This feature works only on the two largest tiers, because taking into account others, could hurt the compaction efficiency which is based on the fact that the more similar-sized sstables are compacted together the higher the compaction efficiency will be. With SAG enabled, min_threshold only plays an important role on the smallest tiers, given that the second-largest tier could be compacted into the largest tier for a space_amplification_goal value < 2. By making the options space_amplification_goal and min_threshold independent, user will be able to tune write amplification and space amplification, based on the needs. The lower the space_amplification_goal the higher the write amplification, but by increasing the min threshold, the write amplification can be decreased to a desired amount. " commit 7d90911c5fb3fa891ad64a62147c3a6ca26d61b1 Author: Raphael S. Carvalho <raphaelsc@scylladb.com> Date: Sat Oct 16 13:41:46 2021 -0300 compaction: ICS: Add garbage collection Today, ICS lacks an approach to persist expired tombstones in a timely manner, which is a problem because accumulation of tombstones are known to affecting latency considerably. For an expired tombstone to be purged, it has to reach the top of the LSM tree and hope that older overlapping data wasn't introduced at the bottom. The condition are there and must be satisfied to avoid data resurrection. STCS, today, has an inefficient garbage collection approach because it only picks a single sstable, which satisfies the tombstone density threshold and file staleness. That's a problem because overlapping data either on same tier or smaller tiers will prevent tombstones from being purged. Also, nothing is done to push the tombstones to the top of the tree, for the conditions to be eventually satisfied. Due to incremental compaction, ICS can more easily have an effecient GC by doing cross-tier compaction of relevant tiers. The trigger will be file staleness and tombstone density, which threshold values can be configured by tombstone_compaction_interval and tombstone_threshold, respectively. If ICS finds a tier which meets both conditions, then that tier and the larger[1] and closest-in-size[2] tier will be compacted together. [1]: A larger tier is picked because we want tombstones to eventually reach the top of the tree. [2]: It also has to be the closest-in-size tier as the smaller the size difference the higher the efficiency of the compaction. We want to minimize write amplification as much as possible. The staleness condition is there to prevent the same file from being picked over and over again in a short interval. With this approach, ICS will be continuously working to purge garbage while not hurting overall efficiency on a steady state, as same-tier compactions are prioritized. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211016164146.38010-1-raphaelsc@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#22063	2025-01-04 15:43:52 +02:00
Kefu Chai	220cafe7c4	test.py: Fix path checking for combined_test executable Previously in `8b7a5ca88d`, we checked for combined_test existence without the "build" component in the path. This caused the test suite to never find the executable, preventing the test cases' cache from being populated. Changes: 1. Use path_to() to check executable existence, which: - Includes the "build" component in path - Handles both CMake and configure.py build paths 2. Move existence check out of _generate_cache() for clarity This ensures combined_test and its included tests are properly discovered and run. Fixes scylladb/scylladb#22086 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-04 06:11:21 +08:00
Kefu Chai	9d0f27e7c1	test.py: Throw only if scylla executable is not found Previously, we had inconsistent behavior around missing executables: - `561e88f0` added early failure if any executable was missing - `8b7a5ca8` added a partial skip for combined_test, but didn't properly handle build paths and artifacts This change: 1. Moves executable existence check to PythonTestSuite class 3. Eliminates redundant os.access() checks This allows running tests with a partial build while properly handling missing executables, particularly for the combined_test suite. In a succeeding change, we will correct the check for combined_tests. Refs scylladb/scylladb#19489 Refs scylladb/scylladb#22086 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-04 06:11:21 +08:00
Tomasz Grabiec	4c89e62470	Merge 'Phased barrier improvements' from Benny Halevy - utils: phased_barrier: advance_and_await: allocate new gate only when needed - utils: phased_barrier: add close() method - and use in existing services * Improvement. No backport needed Closes scylladb/scylladb#22018 * github.com:scylladb/scylladb: utils: phased_barrier: add close() method utils: phased_barrier: advance_and_await: allocate new gate only when needed	2025-01-03 18:51:23 +01:00
Sergey Zolotukhin	155480595f	storage_proxy/read_repair: Remove redundant 'schema' parameter from `data_read_resolver::resolve` function. The `data_read_resolver` class inherits from `abstract_read_resolver`, which already includes the `schema_ptr _schema` member. Therefore, using a separate function parameter in `data_read_resolver::resolve` initialized with the same variable in `abstract_read_executor` is redundant.	2025-01-03 10:04:13 +01:00
Sergey Zolotukhin	39785c6f4e	storage_proxy/read_repair: Use `partition_key` instead of `token` key for mutation diff calculation hashmap. This update addresses an issue in the mutation diff calculation algorithm used during read repair. Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on the Murmur3 hash function, it could generate duplicate values for different partition keys, causing corruption in the affected rows' values. Fixes scylladb/scylladb#19101	2025-01-03 09:53:02 +01:00
Sergey Zolotukhin	e577f1d141	test: Add test case for checking read repair diff calculation when having conflicting keys. The test updates two rows with keys that result in a Murmur3 hash collision, which is used to generate Scylla tokens. These tokens are involved in read repair diff calculations. Due to the identical token values, a hash map key collision occurs. Consequently, an incorrect value from the second row (with a different primary key) is then sent for writing as 'repaired', causing data corruption.	2025-01-03 09:53:02 +01:00
Avi Kivity	202f16e799	Merge 'Introduce workload prioritization for service levels' from Piotr Dulikowski This series introduces workload prioritization: an extension of the service levels feature which allows specifying "shares" per service level. The number of shares determines the priority of the user which has this service level attached (if multiple are attached then the one with the lowest shares wins). Different service levels will be isolated in the following way: - Each service level gets its own scheduling group with the number of shares (corresponding to the service level's number of shares), which controls the priority of the CPU and I/O used for user operations running on that service level. - Each service level gets two reader concurrency semaphores, one for user reads and the other for read-before-write done for view updates. - Each service level gets its own TCP connections for RPC to prevent priority inversion issues. Because of the mandatory use of scheduling groups, which are a globally limited resource, the number of service levels is now limited to 7 user created service levels + 1 created by default that cannot be removed. This feature has been previously only available in ScyllaDB Enterprise but has been made available for the source available ScyllaDB. The series was created by comparing the master branch with source-available-workbranch / enterprise branch and taking the workload prioritization related parts from the diff, then molding the resulting diff into a proper series. Some very minor changes were made such as fixing whitespace, removing unused or unnecessary code, adding some boilerplate (in api/) which was missing, but otherwise no major changes have been made. No backport is required. Closes scylladb/scylladb#22031 * github.com:scylladb/scylladb: tracing: record scheduling group in trace event record qos: un-shared-from-this standard_service_level_distributed_data_accessor alternator: execute under scheduling group for service level test.py: support multiple commands in prepare_cql in suite.yml docs: add documentation for workload prioritization docs/dev: describe workload prioritization features in service_levels test/auth_cluster: test workload prioritization in service level tests cqlpy/test_service_levels: add workload prioritization tests api: introduce service levels specific API api/cql_server_test: add information about scheduling group db/virtual_tables: add scheduling group column to system.clients test/boost: update service_level_controller_test for workload prio qos: include number of shares in DESCRIBE cql3/statements: update SL statements for workload prioritization transport/server: use scheduling group assigned to current user messaging_service: use separate set of connections per service levels replica/database: add reader concurrency semaphore groups qos: manage and assign scheduling groups to service levels qos: use the shares field in service level reads/writes qos: add shares to service_level_options qos: explicitly specify columns when querying service level tables db/system_distributed_keyspace: add shares column and upgrade code db/system_keyspace: adjust SL schema for workload prioritization gms: introduce WORKLOAD_PRIORITIZATION cluster feature build: increase the max number of scheduling groups qos: return correct error code when SL does not exist	2025-01-02 20:05:36 +02:00
Kefu Chai	0ea8cd2bb8	test/pylib/minio_server: use error level for fatal errors Previously fatal errors like missing Minio executable were logged at INFO level, which could be filtered out by log settings. Switch to ERROR level to ensure these critical issues are always visible to developers. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22084	2025-01-02 20:03:55 +02:00
Gleb Natapov	245483f1bc	topology coordinator: reject replace request if topology does not match Currently it should not happen because gossiper shadow round does similar check, but we want to drop states that propagate through raft from the gossiper eventually.	2025-01-02 18:44:19 +02:00
Gleb Natapov	7e3a196734	gossiper: fix the logic of shadow_round parameter Currently the logic is mirrored shadow_round is true in on shadow round. Fix it but flipping all the logic.	2025-01-02 18:44:19 +02:00
Gleb Natapov	2736a3e152	storage_service: do not add endpoint to the gossiper during topology loading. As removed comment says it was done because storage_service::join_cluster did not load gossiper endpoint but now it does.	2025-01-02 18:44:19 +02:00
Gleb Natapov	4fee8e0e09	storage_service: load peers into gossiper on boot in raft topology mode Gossiper manages address map now, so load peers table into the gossiper on reboot to be able to map ids to ips as early as possible.	2025-01-02 18:44:19 +02:00
Gleb Natapov	acbc667d3e	storage_service: set raft topology change mode before using it in join_cluster ss::join_cluster calls raft_topology_change_enabled() before the mode is initialized below in the same function. Fix it by changing the order.	2025-01-02 18:44:19 +02:00
Gleb Natapov	491b7232de	locator: drop inet_address usage to figure out per dc/rack replication It allows to correctly calculate replication map even without knowing IPs of the nodes.	2025-01-02 18:44:19 +02:00
Botond Dénes	7d42b80228	service/storage_proxy: data_read_resolver::resolve(): remove unneded maybe_yield() We already have a yield in the loop via apply_gently(), the maybe_yield is superfluous so remove it. Follow-up to https://github.com/scylladb/scylladb/pull/21884 Closes scylladb/scylladb#21984	2025-01-02 16:13:29 +01:00
Kefu Chai	de42dce4c4	pgo: use java-11 when running cassandra-stress we updated tools/java/build.xml recently to only build for java-11. so if - the `java` executable in `$PATH` points to a java which is neither java-8 nor java-11. - java-8 is installed java-8 is used to execute the cassandra-stress tool. and we would have following failure: ``` Error: A JNI error has occurred, please check your installation and try again Exception in thread "main" java.lang.UnsupportedClassVersionError: org/apache/cassandra/stress/Stress has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recogniz es class file versions up to 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:473) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:621) ``` in order to be compatible with the bytecode targeting java-11, let's run cassandra-stress with java-11. we do not need to support java-8, because the new tools/java is now building cassandra-stress targeting java-11 jre. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22142	2025-01-02 16:56:29 +02:00
Artsiom Mishuta	174199610b	test.py: add more log info if the server is broken attribute server_broken_reason into the server was introduced, to store the raw information regarding why the server was broken additional information was added in the error messages in case of "server broken" fixes: #21630 Closes scylladb/scylladb#22074	2025-01-02 16:54:55 +02:00
Kefu Chai	233e3969c4	utils: correct misspellings these misspellings were identified by codespell. let's fix them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22143	2025-01-02 16:47:57 +02:00
Avi Kivity	1ce373d80b	schema: deinline some speculative_retry methods This string conversion functions are not in any fast path. Deinlining them moves a <boost/lexical_cast.hpp> include out of a common header file. Some files accessed on boost::iterator_range via lexical_cast.hpp, so they gain a new dependency. Closes scylladb/scylladb#21950	2025-01-02 12:28:33 +01:00
Avi Kivity	051c310f02	tracing: record scheduling group in trace event record We have a "thread" field (unfortunately not yet displayed in cqlsh, but visible in the table) that records the shard on which a particular event was recorded. Record the scheduling group as well, as this can be useful to understand where the query came from. (cherry picked from commit 3c03b5f66376dca230868e54148ad1c6a1ad0ee2)	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	07fdf9d21f	qos: un-shared-from-this standard_service_level_distributed_data_accessor Apparently, it is not needed for standard_service_level_distributed_data_accessor to derive from enable_shared_from_this.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	b23bc3a5d5	alternator: execute under scheduling group for service level Now, the Alternator API requests are executed under the correct scheduling group of the service level assigned to the currently logged in user.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	67b11e846a	test.py: support multiple commands in prepare_cql in suite.yml This will be needed for alternator tests introduced in the next commit, which will have to execute multiple CQL operations during preparation.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	07b162fb5b	docs: add documentation for workload prioritization The doc pages were slightly adjusted during migration not to mention Scylla Enterprise and to fix some whitespace issues.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	241e710c19	docs/dev: describe workload prioritization features in service_levels The concept of shares, and some helper HTTP APIs, are now described in the developer documentation for service levels.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	473bb44722	test/auth_cluster: test workload prioritization in service level tests Update `test_connections_parameters_auto_update` to also check that the scheduling group of given connections is appropriately changed when a different service level is assigned to the user that the connection uses for authentication. Apart from that, more tests are added: - Check for the logic that forbids setting shares for a service level until all nodes in the cluster are upgraded - Test for handling the case when there are more scheduling groups than it is allowed (it might happen after upgrade from a non-workload-prio version) - Regression test for a bug where less scheduling groups could have been created than allowed due to some metrics not being renamed on scheduling group name change.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	29b153c9e7	cqlpy/test_service_levels: add workload prioritization tests Adjust existing cqlpy tests and add more in order to test the workload prioritization feature: - The DESCRIBE test is updated to check that generated statements contain information about shares - Two tests for shares in the LIST EFFECTIVE SERVICE LEVEL statement - Regression test which checks that we can create as many service levels as promised in the documentation (currently 7), but no more - Test which checks that NULL shares in the service levels table are treated as the default 1000 shares	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	49f5fc0e70	api: introduce service levels specific API Introduces two endpoints with operations specific to service levels: - switch_tenants: updates the scheduling group of all connections to be aligned with the service level specific to the logged in user. This is mostly legacy API, as with service levels on raft this is done automatically. - count_connections: for each user and for each scheduling group, counts how many connections are assigned to that user and scheduling group. This API is used in tests.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	a65c0c3735	api/cql_server_test: add information about scheduling group Now, information about connections' scheduling group is included in the HTTP API for querying information about connections' parameters.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	9319d65971	db/virtual_tables: add scheduling group column to system.clients Add the "scheduling_group" column to the system.clients table which names the scheduling group that currently serves the connection/client.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	bbc655ff32	test/boost: update service_level_controller_test for workload prio Adjust some of the existing tests in service_level_controller_test.cc and add some more in order to test the workload prioritization features, i.e. the service level shares.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	ce4032dfc0	qos: include number of shares in DESCRIBE Now, the CREATE statements generated for each service level by the DESCRIBE SCHEMA WITH INTERNALS statement will account for the service level's shares.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	0f62eb45d1	cql3/statements: update SL statements for workload prioritization Introduce the "SHARES" keyword which can be used in conjunction with existing CQL statements related to the service levels. Adjust the CQL statements for service levels: - CREATE/ALTER now allow to set shares (only if the cluster is fully upgraded) - LIST EFFECTIVE SERVICE LEVEL now return the number of shares in a new column - LIST SERVICE LEVEL(S) also return the number of shares, and has the additional column "percentage of all service level shares"	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	6d90a933cd	transport/server: use scheduling group assigned to current user Now, when the user logs in and the connection becomes authenticated, the processing loop of the connection is switched to the scheduling group that corresponds to the service level assigned to the logged in user. The scheduling group is also updated when the service level assigned to this user changes. Starting from this commit, the scheduling groups managed by the service level controller are actually being used by user workload.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	f1b9737e07	messaging_service: use separate set of connections per service levels In order to make sure that the scheduling group carries over RPC, and also to prevent priority inversion issues between different service levels, modify the messaging service to use separate RPC connections for each service level in order to serve user traffic. The above is achieved by reusing the existing concept of "tenants" in messaging service: when a new service level (or, more accurately, service-level specific scheduling group) is first used in an RPC, a new tenant is created. In addition, extend the service level controller to be able to quickly look up the service level name of the currently active scheduling group in order to speed up the logic for choosing the tenant.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	7383013f43	replica/database: add reader concurrency semaphore groups Replace the reader concurrency semaphores for user reads and view updates with the newly introduced reader concurrency semaphore group, which assigns a semaphore for each service level. Each group is statically assigned to some pool of memory on startup and dynamically distribute this memory between the semaphores, relative to the number of shares of the corresponding scheduling group. The intent of having a separate reader concurrency semaphore for each scheduling group is to prevent priority inversion issues due to reads with different priorities waiting on the same semaphore, as well as make memory allocation more fair between service levels due to the adjusted number of shares.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	4cfd26efaf	qos: manage and assign scheduling groups to service levels Introduce the core logic of workload prioritization, responsible for assigning scheduling groups to service levels. The service level controller maintains a pool of scheduling groups for the currently present service levels, as well as a pool of unused scheduling groups which were previously used by some service level that was deleted during node's lifetime. When a new service level is created, the SL controller either assigns a scheduling group from the unused SG pool, or creates a new one if the pool is empty. The scheduling group is renamed to "sl:<scheduling group name>". When updating shares of a service level (and also when creating a new service level), the shares of the corresponding scheduling group are synchronized with those of the service level. When a service level is deleted, its group is released to the aforementioned pool of unused scheduling groups and the prefix of its name is changed from "sl:" to "sl_deleted:". For now, these scheduling groups are not used by any user operations. This will be changed in subsequent commits.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	ff51551a94	qos: use the shares field in service level reads/writes Now, the newly introduced `shares` field is used when service levels are either read from or written into system tables.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	a6f681029f	qos: add shares to service_level_options Add service level shares related fields to service_level_options and slo_effective_names structs, and adjust the existing methods of the former (merge_with, init_effective_names) to account for them.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	2eb35f37d0	qos: explicitly specify columns when querying service level tables The service levels table is queried with a `SELECT * ...` query, by using the `execute_internal` method which prepares and caches the query in an special cache for internal queries, separate from the user query cache. During rolling upgrade from a version which does not support service level shares to the one that does, the `shares` column is added. The aforementioned internal query cache is _not_ invalidated on schema change, so the cache might still contain the prepared query from the time before the column was added, and that prepared query will fetch the old set of column without the new `shares` column. In order to solve this, explicitly specify the columns in the query string, using the full set of column names from the time when the query is executed. Note that this is a problem only for the legacy, non-raft service levels. Raft-based service levels use a local table for which the schema is determined on startup. Also note that this code only fetches values from the `shares` column but does not make any use of it otherwise. It will be handled by later commits in this series.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	ea25b29684	db/system_distributed_keyspace: add shares column and upgrade code Add the "shares" column to the system_distributed_keyspace.service_levels table, which is used by legacy code. Because this table is in a distributed and not local keyspace, adding the column to an existing cluster during rolling upgrade requires a bit of care. A callback is added to the workload prioritization cluster feature which runs when the feature becomes enabled and adds the column for all nodes in the cluster.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	346fc84c3e	db/system_keyspace: adjust SL schema for workload prioritization Add a "shares" column which hold the number of shares allocated to given service level. It is not used by the code at all right now, subsequent commits will make good use of it.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	ecbf8721de	gms: introduce WORKLOAD_PRIORITIZATION cluster feature Information about the number of shares per service level will be stored in an additional column in the service levels table, which is managed through group0. We will need the feature to make sure that all nodes in the cluster know about the new column before any node starts applying group0 commands the would touch the new column. This feature also serves a role for the legacy service levels implementation that uses system_distributed for storage: after all nodes are upgraded to support workload prioritization, one of the nodes will perform a schema change operation and will add the new column.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	75d2d0d949	build: increase the max number of scheduling groups Workload prioritization assigns scheduling groups to service levels, and the number of scheduling groups that can exist at the same time is limited with a compile-time parameter in seastar. The documentation for workload prioritization says that we currently support 7 user-managed service levels and 1 created by default. Increase the current compile-time limit in order to align with the documentation.	2025-01-02 07:13:34 +01:00
Piotr Dulikowski	48e7ffc300	qos: return correct error code when SL does not exist The `nonexistant_service_level_exception` can be thrown by service levels code and propagated up to the CQL server layer, where it is converted into a CQL protocol error. The aforementioned exception inherits from `service_level_argument_exception`, which in turn inherits from `std::invalid_argument` - which doesn't mean much to the CQL layer and is converted to a generic SERVER_ERROR. We can do better and return a more meaningful error code for this exception. Change the base class of service_level_argument_exception to exceptions::invalid_request_exception which gets converted to an INVALID error. The INVALID error code was already being used by the enterprise version, so this commit just synchronizes error handling with enterprise.	2025-01-02 07:13:34 +01:00
Avi Kivity	727f68e0f5	Merge 'cql3: allow SELECT of specific collection element' from Michael Litvak This adds to the grammar the option to SELECT a specific element in a collection (map/set/list). For example: `SELECT map['key'] FROM table` `SELECT map['key1']['key2'] FROM table` This feature was implemented in Cassandra 4.0 and was requested by scylla users. The behavior is mostly compatible with Cassandra, except: 1. in SELECT, we allow list subscript in a selector, while cassandra allows only map and set. 2. in UPDATE, we allow set subscript in a column condition, while cassandra allows only map and list. 3. the slice syntax `SELECT m[a..b]` is not implemented yet 4. null subscript - `SELECT m[null]` returns null in scylla, while cassandra returns error Fixes #7751 backport was requested for a user to be able to use it Closes scylladb/scylladb#22051 * github.com:scylladb/scylladb: cql3: allow SELECT of specific collection key cql3: allow set subscript	2025-01-01 14:48:40 +02:00
Gleb Natapov	c4b26ba8dc	test: drop test_old_ip_notification_repro.py The test no longer test anything since the address map is updated much earlier now by the gossiper itself, not by the notifiers. The functionality is tested by a unit test now.	2025-01-01 12:43:11 +02:00
Gleb Natapov	c4db90799a	test: address_map: check generation handling during entry addition Check that adding an entry with smaller generation does not overwrite existing entry.	2025-01-01 12:43:11 +02:00
Benny Halevy	85bd799308	storage_service: replicate_to_all_cores: prevent stalls when preparing per-table erms Although the `network_topology_stratergy::make_replication_map` -> `tablet_aware_replication_strategy::do_make_replication_map` is not cpu intensive it still allocates and constructs a shared `tablet_effective_replication_map`, and that might stall with thousands of tablet-based tables. Therefore coroutinize the preparation loop to allow yielding. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-31 14:52:39 +01:00
Gleb Natapov	745b6d7d0d	gossiper: ignore gossiper entries with local host id in gossiper mode as well We already ignore a gossiper entries with host id equal to local host id in raft mode since those entries are just outdated entries since before ip change. The same logic applies to gossiper mode as well though, so do the same in both modes. Fixes: scylladb/scylladb#21930 Message-ID: <Z20kBZvpJ1fP9WyJ@scylladb.com>	2024-12-31 15:50:12 +02:00
Avi Kivity	76cf5148e1	Merge 'message: introduce advanced rpc compression' from Michał Chojnowski This is a forward port (from scylla-enterprise) of additional compression options (zstd, dictionaries shared across messages) for inter-node network traffic. It works as follows: After the patch, messaging_service (Scylla's interface for all inter-node communication) compresses its network traffic with compressors managed by the new advanced_rpc_compression::tracker. Those compressors compress with lz4, but can also be configured to use zstd as long as a CPU usage limit isn't crossed. A precomputed compression dictionary can be fed to the tracker. Each connection handled by the tracker will then start a negotiation with the other end to switch to this dictionary, and when it succeeds, the connection will start being compressed using that dictionary. All traffic going through the tracker is passed as a single merged "stream" through dict_sampler. dictionary_service has access to the dict_sampler. On chosen nodes (in the "usual" configuration: the Raft leader), it uses the sampler to maintain a random multi-megabyte sample of the sampler's stream. Every several minutes, it copies the sample, trains a compression dictionary on it (by calling zstd's training library via the alien_worker thread) and publishes the new dictionary to system.dicts via Raft's write_mutation command. This update triggers (eventually) a callback on all nodes, which feeds the new dictionary to advanced_rpc_compression::tracker, and this switches (eventually) all inter-node connections to this dictionary. Closes scylladb/scylladb#22032 * github.com:scylladb/scylladb: messaging_service: use advanced_rpc_compression::tracker for compression message/dictionary_service: introduce dictionary_service service: make Raft group 0 aware of system.dicts db/system_keyspace: add system.dicts utils: add advanced_rpc_compressor utils: add dict_trainer utils: introduce reservoir_sampling utils: introduce alien_worker utils: add stream_compressor	2024-12-31 15:02:57 +02:00
Evgeniy Naydanov	4260f3f55a	test.py: topology_random_failures: log randomization parameters in test Logging randomization parameters in the pytest_generate_tests hook doesn't play well for us. To make these parameters more visible move the logging to the test level. Closes scylladb/scylladb#22055	2024-12-31 14:23:47 +02:00
Avi Kivity	2b48c2e72a	Merge 'build: add support for LTO and PGO to the building system' from Kefu Chai This changeset ports LTO and PGO support from scylla-enterprise.git to scylladb.git. Add support for Link-Time Optimization (LTO) and Profile-Guided Optimization (PGO) to improve performance. LTO provides ~7% performance gain and enables crucial binary layout optimizations for PGO. LTO Changes: - Add `-flto` flag to compile and link steps - Use `-ffat-lto-objects` to generate both LLVM IR and machine code - Enable cross-object optimization while maintaining fast test linking PGO Implementation: - Implement three-stage build process: 1. Context-free profiling (`-fprofile-generate`) 2. Context-sensitive profiling (`-fprofile-use` + `-fcs-profile-generate`) 3. Final optimization using merged profiles - Add release-pgo and release-cs-pgo build stages - Integrate with ninja build system - Stages can be enabled independently Profile Management: - Add `pgo/pgo.py` for workload profile collection - Store default profile in `pgo/profiles/profile.profdata.xz` using Git LFS - Add configure.py integration for profile detection and validation - Support custom profiles via `--use-profile` flag - Add profile regeneration script Both optimizations are recommended for maximum performance, though each PGO stage adds a full build cycle. Future optimization may allow dropping one PGO stage if performance impact is minimal. --- this is a forward port, hence no need to backport. Closes scylladb/scylladb#22039 * github.com:scylladb/scylladb: build: cmake: add CMake options for PGO support build: cmake: add "Scylla_ENABLE_LTO" option build: set LTO and PGO flags for Seastar in cmake build build: collect scylla libraries with `scylla_libs` variable build: Unify Abseil CXX flags configuration configure.py: prepare the build for a default PGO profile in version control configure.py: introduce profile-guided optimization pgo: add alternator workloads training pgo: add a repair workload pgo: add a counters workload pgo: add a secondary index workload pgo: add a LWT workload pgo: add a decommission workload pgo: add a clustering workload pgo: add a basic workload pgo: introduce a PGO training script configure.py: don't include non-default modes in dist-server-* rules configure.py: enable LTO in release builds by default configure.py: introduce link-time optimization configure.py: add a `default` to `add_tristate`. configure.py: unify build rules for cxxbridge .cc files and regular .cc files	2024-12-31 14:14:40 +02:00
Avi Kivity	4905b1bf76	Merge 'table: make update_effective_replication_map sync again' from Benny Halevy Commit `f2ff701489` introduced a yield in update_effective_replication_map that might cause the storage_group manager to be inconsistent with the new effective_replication_map (e.g. if yielding right before calling `handle_tablet_split_completion`. Also, yielding inside storage_service::replicate_to_all_cores update loop means that base tables and their views aren't updated atomically, that caused scylladb/scylladb#17786 This change essentially reverts `f2ff701489` and makes handle_tablet_split_completion synchronous too. The stopped compaction groups future is kept as a member and storage_group_manager::stop() consumes this future during table::stop(). - storage_service: replicate_to_all_cores: update base and view tables atomically Currently, the loop updating all tables (including views) with the new effective_replication_map may yield, and therefore expose a state where the base and view tables effective_replication_map and topology are out of sync (as seen in scylladb/scylladb#17786) To prevent that, loop over all base tables and for each table update the base table and all views atomically, without yielding, and so allow yielding only between base tables. * Regression was introduced in `f2ff701489`, so backport is required to 6.x, 2024.2 Closes scylladb/scylladb#21781 * github.com:scylladb/scylladb: storage_service: replicate_to_all_cores: clear_gently pending erms test_mv_topology_change: drop delay_after_erm_update injection case storage_service: replicate_to_all_cores: update base and view tables atomically table: make update_effective_replication_map sync again	2024-12-30 23:42:06 +02:00
Tomasz Grabiec	bf3d0b3543	reader_concurrency_semaphore: Optimize resource_units destruction by postponing wait list processing Observed 3% throughput improvement in sstable-heavy workload bounded by CPU. SStable parsing involves lots of buffer operations which obtain and destroy resource_units. Before the patch, reosurce_unit destruction invoked maybe_admit_waiters(), which performs some computations on waiting permits. We don't really need to admit on each change of resources, since the CPU is used by other things anyway. We can batch the computation. There is already a fiber which does this for processing the _ready_list. We can reuse it for processing _wait_list as well. The changes violate an assumption made by tests that releasing resources immediately triggers an admission check. Therefore, some of the BOOST_REQUIRE_EQUAL needs to be replaced with REQUIRE_EVENTUALLY_EQUAL as the admision check is now done in the fiber processing the _ready_list. `perf-simple-query` --tablets --smp 1 -m 1G results obtained for fixed 400MHz frequency: Before: ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 112590.60 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41353 insns/op, 17992 cycles/op, 0 errors) 122620.68 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41310 insns/op, 17713 cycles/op, 0 errors) 118169.48 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41353 insns/op, 17857 cycles/op, 0 errors) 120634.65 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41328 insns/op, 17733 cycles/op, 0 errors) 117317.18 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41347 insns/op, 17822 cycles/op, 0 errors) throughput: mean=118266.52 standard-deviation=3797.81 median=118169.48 median-absolute-deviation=2368.13 maximum=122620.68 minimum=112590.60 instructions_per_op: mean=41337.86 standard-deviation=18.73 median=41346.89 median-absolute-deviation=14.64 maximum=41352.53 minimum=41309.83 cpu_cycles_per_op: mean=17823.50 standard-deviation=111.75 median=17821.97 median-absolute-deviation=90.45 maximum=17992.04 minimum=17713.00 ``` After ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 123689.63 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40997 insns/op, 17384 cycles/op, 0 errors) 129643.24 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40997 insns/op, 17325 cycles/op, 0 errors) 128907.27 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41009 insns/op, 17325 cycles/op, 0 errors) 130342.56 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40993 insns/op, 17286 cycles/op, 0 errors) 130294.09 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40972 insns/op, 17336 cycles/op, 0 errors) throughput: mean=128575.36 standard-deviation=2792.75 median=129643.24 median-absolute-deviation=1718.73 maximum=130342.56 minimum=123689.63 instructions_per_op: mean=40993.51 standard-deviation=13.23 median=40996.73 median-absolute-deviation=3.30 maximum=41008.86 minimum=40972.48 cpu_cycles_per_op: mean=17331.16 standard-deviation=35.02 median=17324.84 median-absolute-deviation=6.49 maximum=17383.97 minimum=17286.33 ``` Closes scylladb/scylladb#21918 [avi: patch was co-authored by Łukasz Paszkowski <lukasz.paszkowski@scylladb.com>]	2024-12-30 23:37:46 +02:00
Michael Litvak	5ef7afb968	cql3: allow SELECT of specific collection key This adds to the grammar the option to SELECT a specific key in a collection column using subscript syntax. For example: SELECT map['key'] FROM table SELECT map['key1']['key2'] FROM table The key can also be parameterized in a prepared query. For this we need to pass the query options to result_set_builder where we process the selectors. Fixes scylladb/scylladb#7751	2024-12-30 17:05:20 +02:00
Wojciech Mitros	74cbc77f50	test: add test for schema registry maintaining base info for views In this patch we test the behavior of schema registry in a few scenarios where it was identified it could misbehave. The first one is reverse schemas for views. Previously, SELECT queries with reverse order on views could fail because we didn't have base info in the registry for such schemas. The second one is schemas that temporarily died in the registry. This can happen when, while processing a query for a given schema version, all related schema_ptrs were destroyed, but this schema was requested before schema_registry::grace_period() has passed. In this scenario, the base info would not be recovered, causing errors.	2024-12-30 14:59:06 +01:00
Wojciech Mitros	3094ff7cbe	schema_registry: avoid setting base info when getting the schema from registry After the previous patches, the view schemas returned by schema registry always have their base info set. As such, we no longer need to set it after getting the view schema from the registry. This patch removes these unnecessary updates.	2024-12-30 14:56:18 +01:00
Wojciech Mitros	82f2e1b44c	schema_registry: update cached base schemas when updating a view The schema registry now holds base schemas for view schemas. The base schema may change without changing the view schema, so to preserve the change in the schema registry, we also update the base schema in the registry when updating the base info in the view schema.	2024-12-30 14:56:18 +01:00
Wojciech Mitros	dfe3810f64	schema_registry: cache base schemas for views Currently, when we load a frozen schema into the registry, we lose the base info if the schema was of a view. Because of that, in various places we need to set the base info again, and in some codepaths we may miss it completely, which may make us unable to process some requests (for example, when executing reverse queries on views). Even after setting the base info, we may still lose it if the schema entry gets deactivated. To fix this, this patch adds the base schema to the registry, alongside the view schema. With the base schema, we can now set the base info when returning the schema from the registry. As a result, we can now assume that all view schemas returned by the registry have base_info set. To store the base schema, the loader methods now have to return the base schema alongside the view schema. At the same time, when loading into the registry, we need to check whether we're loading a view schema, and if so, we need to also provide the base schema. When inserting a regular table schema, the base schema should be a disengaged optional.	2024-12-30 14:56:17 +01:00
Wojciech Mitros	6f11edbf3f	db: set base info before adding schema to registry In the following patches, we'll assure that view schemas returned by the schema registry always have base info set. To prepare for that, make sure that the base info is always set before inserting it into schema registry,	2024-12-30 14:56:17 +01:00
Avi Kivity	b32b7ab806	Merge 'test.py: only access combined_tests executable if it is built' from Konstantin Osipov test.py: only access combined_tests executable if it is built Fixes #22038 Closes scylladb/scylladb#22069 * github.com:scylladb/scylladb: test.py: only access combined_tests if it exists test.py: rethrow CancelledError when executing a test	2024-12-30 15:15:39 +02:00
Piotr Smaron	2352063f20	server: set `connection_stage` to READY when authenticated If authentication is enabled, but STARTUP isn't followed by REGISTER (which is optional, and in practice only happens on only one of a driver's connections — because there's no point listening for the same events on multiple connections), connections are wrongly displayed in the system.clients as AUTHENTICATING instead of READY, even when they are ready. This commit fixes this problem. Fixes: scylladb/scylladb#12640 Closes scylladb/scylladb#21774	2024-12-30 14:04:26 +02:00
Kefu Chai	6281fb825f	test/pytest.ini: ignore warning on deprecated record_property fixture `record_property` generates XML which is not compatible with xunit2, so pytest decided to deprecated when the generating xunit reports. and pytest generates following warning when a test failure is reported using this fixture: ``` object_store/test_backup.py:337: PytestWarning: record_property is incompatible with junit_family 'xunit2' (use 'legacy' or 'xunit1') ``` this warning is not related to the test, but more about how we report a failure using pytrest. it is distracting, so let's silence it. See also https://github.com/pytest-dev/pytest/issues/5202 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22067	2024-12-30 10:58:31 +02:00
Nadav Har'El	27180620af	Merge 'topology_random_failures: deselect more cases which can cause #21534 ' from Evgeniy Naydanov There are many CI failures (repros of https://github.com/scylladb/scylladb/issues/21534) which caused by `stop_after_setting_mode_to_normal_raft_topology` and `stop_before_becoming_raft_voter` error injections in combination with some cluster events. Need to deselect them for now to make CI more stable. First batch deselected in https://github.com/scylladb/scylladb/pull/21658 Also, add the handling of topology state rollback caused by `stop_before_streaming` or `stop_after_updating_cdc_generation` error injections as a separate commit. See also https://github.com/scylladb/scylladb/issues/21872 and https://github.com/scylladb/scylladb/issues/21957 Closes scylladb/scylladb#22044 * github.com:scylladb/scylladb: test.py: topology_random_failures: more deselects for #21534 test.py: topology_random_failures: handle more node's hangs during 30s sleep	2024-12-30 10:52:22 +02:00
Michael Litvak	2701b5d50d	cql3: allow set subscript This allows to use subscript on a set column, in addition to map/list which was possible until now. The behavior is compatible with Cassandra - a subscript with a specific value returns the value if it's found in the set, and null otherwise.	2024-12-30 09:50:31 +02:00
Konstantin Osipov	8b7a5ca88d	test.py: only access combined_tests if it exists When the scylla source tree is only partially built, we still may want to run the tests. test.py builds a case cache at boot, and executes --list-cases for that, for all built tests. After amalgamating boost unit tests into a single file, it started running it unconditionally, which broke partial builds. Hence, only use combined_tests executable if it exists. Fixes #22038	2024-12-27 14:54:13 -05:00
Konstantin Osipov	2b1ba9c3fd	test.py: rethrow CancelledError when executing a test Commit `870f3b00fc`, "Add option to fail after number of failures" adds tracking on the number of cancelled tests. For the purpose, it intercepts CancelledError and sets test's is_cancelled flag. This introduced a regression reported in gh-21636: Ctrl-C no longer works, since CancelledError is muted. There was no intent to mute the exception, re-throw it after accounting the test as cancelled.	2024-12-27 14:40:47 -05:00
Michał Chojnowski	fdb2d2209c	messaging_service: use advanced_rpc_compression::tracker for compression This patch sets up an `alien_worker`, `advanced_rpc_compression::tracker`, `dict_sampler` and `dictionary_service` in `main()`, and wires them to each other and to `messaging_service`. `messaging_service` compresses its network traffic with compressors managed by the `advanced_rpc_compression::tracker`. All this traffic is passed as a single merged "stream" through `dict_sampler`. `dictionary_service` has access to `dict_sampler`. On chosen nodes (by default: the Raft leader), it uses the sampler to maintain a random multi-megabyte sample of the sampler's stream. Every several minutes, it copies the sample, trains a compression dictionary on it (by calling zstd's training library via the `alien_worker` thread) and publishes the new dictionary to `system.dicts` via Raft. This update triggers a callback into `advanced_rpc_compression::tracker` on all nodes, which updates the dictionary used by the compressors it manages.	2024-12-27 10:17:58 +01:00
Kefu Chai	cf35562e89	test/pylib: use `foo` instead of `'{}'.format(foo)` for better readability Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-12-27 17:09:56 +08:00
Kefu Chai	71eccf01c7	test/pylib: use "foo not in bar" instead of "not foo in bar" for better readability Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-12-27 17:09:56 +08:00
Kefu Chai	6adf70ec03	build: cmake: add CMake options for PGO support - "Scylla_BUILD_INSTRUMENTED" option Scylla_BUILD_INSTRUMENTED allows us to instrument the code at different level, namely, IR, and CSIR. this option mirrors "--pgo" and "--cspgo" options in `configure.py` . please note, the instrumentation at the frontend is not supported, as the IR based instrumentation is better when it comes to the use case of optimization for performance. see https://lists.llvm.org/pipermail/llvm-dev/2015-August/089044.html for the rationales. - "Scylla_PROFDATA_FILE" option this option allows us to specify the profile data previous generated with the "Scylla_BUILD_INSTRUMENTED" option. this option mirrors the `--use-profile` option in `configure.py`, but it does not take the empty option as a special case and consider it as a file fetched from Git LFS. that will be handled by another option in a follow-up change. please note, one cannot use -DScylla_BUILD_INSTRUMENTED=PGO and -DScylla_PROFDATA_FILE=... at the same time. clang just does not allow this. but CSPGO is fine. - "Scylla_PROFDATA_COMPRESSED_FILE" option this option allows us to specify the compressed profile data previouly generated with the "Scylla_BUILD_INSTRUMENTED" option. along with "Scylla_PROFDATA_FILE", this option mirros the functionality of `--use-profile` in `configure.py`. the goal is to ensure user always gets the result with the specified options. if anything goes wrong, we just error out. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-12-27 16:16:04 +08:00
Kefu Chai	4154789670	build: cmake: add "Scylla_ENABLE_LTO" option add an option named "Scylla_ENABLE_LTO", which is off by default. if it is on, build the whole tree with ThinLTO enabled. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-12-27 16:16:04 +08:00
Kefu Chai	2647369d46	build: set LTO and PGO flags for Seastar in cmake build This change extends scylla commit `7cb74df` to scylla-enterprise-commit 4ece7e1. we recently started building Seastar as an external project, so we need to prepare its compilation flags separately. in enterprise scylla, we prepare the LTO and PGO related cflags in `prepare_advanced_optimizations()`. this function is called when preparing the build rules directly from `configure.py`, and despite we have equivalant settings in CMake, they cannot be applied to Seastar due to the reason above. in this change, we set up the the LTO and PGO compilation flags when generating the buiding system for Seastar when building using CMake. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-12-27 16:16:04 +08:00
Kefu Chai	ffe8c5dcdb	build: collect scylla libraries with `scylla_libs` variable with which, we can set the properties of these targets in a single place. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-12-27 16:16:04 +08:00
Kefu Chai	610f1b7a0a	build: Unify Abseil CXX flags configuration - Set ABSL_GCC_FLAGS and ABSL_LLVM_FLAGS with a more generic absl_cxx_flags - Enables more flexible configuration of compiler flags for Abseil libraries - Provides a centralized approach to setting compilation flags Previously, sanitizer-specific flags were directly applied to Abseil library builds. This change allows for more extensible compiling flag management across different build configurations. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-12-27 16:16:04 +08:00
Michał Chojnowski	131b1d6f81	configure.py: prepare the build for a default PGO profile in version control This patch adds the following logic to the release build: pgo/profiles/profile.profdata.xz is the default profile file, compressed. This file is stored in version control using git LFS. A ninja rule is added which creates build/profile.profdata by decompressing it. If no profile file is explicitly specified, ./configure.py checks whether the compressed default profile file exists and is compressed. (If it exists, but isn't compressed, the user most likely has git lfs disabled or not installed. In this case, the file visible in the working tree will be the LFS placeholder text file describing the LFS metadata.) If the compressed file exists, build/profile.profdata is chosen as the used profile file. If it doesn't exist, a warning is printed and configure.py falls back to a profileless build. The default profile file can be explicitly disabled by passing the empty --use-profile="" to configure.py A script is added which re-generates the profile. After the script is run, the re-generated compressed profile can be staged, committed, pushed and merged to update the default profile.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	a868b44ad8	configure.py: introduce profile-guided optimization This commit enables profile-guided optimizations (PGO) in the Scylla build. A full LLVM PGO requires 3 builds: 1. With -fprofile-generate to generate context-free (pre-inlining) profile. This profile influences inlining, indirect-call promotion and call graph simplifications. 2. With -fprofile-use=results_of_build_1 -fcs-profile-generate to generate context-sensitive (post-inlining) profile. This profile influences post-inline and codegen optimizations. 3. With -fprofile-use=merged_results_of_builds_1_2 to build the final binary with both profiles. We do all three in one ninja call by adding release-pgo and release-cs-pgo "stages" to release. They are a copy of regular release mode, just with the flags described above added. With the full course, release objects depend on the profile file produced by build/release-cs-pgo/scylla, while release-cs-pgo depends on the profile file generated by build/release-pgo/scylla. The stages are orthogonal and enabled with separate options. It's recommended to run them both for full performance, but unfortunately each one adds a full build of scylla to the compile time, so maybe we can drop one of them in the future if it turns out e.g. that regular PGO doesn't have a big effect. It's strongly recommended to combine PGO with LTO. The latter enables the entire class of binary layout optimizations, which for us is probably the most important part of the entire thing.	2024-12-27 16:16:04 +08:00
Marcin Maliszkiewicz	80989556ac	pgo: add alternator workloads training This patch adds a set of alternator workloads to pgo training script. To confirm that added workloads are indeed affecting profile we can compare: ⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/clustering/prof.profdata Instrumentation level: IR entry_first = 0 Total functions: 105075 Maximum function count: 1079870885 Maximum internal block count: 2197851358 and ⤖ llvm-profdata show ./build/release-pgo/profiles/workdirs/alternator/prof.profdata Instrumentation level: IR entry_first = 0 Total functions: 105075 Maximum function count: 5240506052 Maximum internal block count: 9112894084 to see that function counters are on similar levels, they are around 5x higher for alternator but that's because it combines 5 specific sub-workloads. To confirm that final profile contains alterantor functions we can inspect: ⤖ llvm-profdata show --counts --function=alternator --value-cutoff 100000 ./build/release-pgo/profiles/merged.profdata (...) Instrumentation level: IR entry_first = 0 Functions shown: 356 Total functions: 105075 Number of functions with maximum count (< 100000): 97275 Number of functions with maximum count (>= 100000): 7800 Maximum function count: 7248370728 Maximum internal block count: 13722347326 we can see that 356 functions which symbol name contains word alternator were identified as 'hot' (with max count grater than 100'000). Running: ⤖ llvm-profdata show --counts --function=alternator --value-cutoff 1 ./build/release-pgo/profiles/merged.profdata (...) Instrumentation level: IR entry_first = 0 Functions shown: 806 Total functions: 105075 Number of functions with maximum count (< 1): 67036 Number of functions with maximum count (>= 1): 38039 Maximum function count: 7248370728 Maximum internal block count: 13722347326 we can see that 806 alternator functions were executed at least once during training. And finally to confirm that alternator specific PGO brings any speedups we run: for workload in read scan write write_gsi write_rmw do ./build/release/scylla perf-alternator-workloads --smp 4 --cpuset "10,12,14,16" --workload $workload --duration 1 --remote-host 127.0.0.1 2> /dev/null \| grep median done results BEFORE: median 258137.51910849303 median absolute deviation: 786.06 median 547.2578202937141 median absolute deviation: 6.33 median 145718.19856685458 median absolute deviation: 5689.79 median 89024.67095807113 median absolute deviation: 1302.56 median 43708.101729598646 median absolute deviation: 294.47 results AFTER: median 303968.55333940056 median absolute deviation: 1152.19 median 622.4757636209254 median absolute deviation: 8.42 median 198566.0403745328 median absolute deviation: 1689.96 median 91696.44912842038 median absolute deviation: 1891.84 median 51445.356525664996 median absolute deviation: 1780.15 We can see that single node cluster tps increase is typically 13% - 17% with notable exceptions, improvement for write_gsi is 3% and for write workload whopping 36%. The increase is on top of CQL PGO. Write workload is executed more often because it's involved also as data preparation for read and scan. Some further improvement could be to separate preparation from training as it's done for CQL but it would be a bit odd if ~3x higher counters for one flow have so big impact. Additional disclaimers: - tests are performing exactly the same workloads as in training so there might be some bias - tests are running single node cluster, more realistic setup will likely show lower improvement Fixes https://github.com/scylladb/scylla-enterprise/issues/4066	2024-12-27 16:16:04 +08:00
Michał Chojnowski	95c8d88b96	pgo: add a repair workload This workload is added to teach PGO about repair. Tests are inconclusive about its alignment with existing workloads, because repair doesn't seem utilize 100% of the reactor.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	1c9ce0a9ee	pgo: add a counters workload This workload is added to teach PGO about counters. Tests seem to show it's mostly aligned with existing CQL workloads. The config YAML is based on the default cassandra-stress schema.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	47dc0399cb	pgo: add a secondary index workload This workload is added to teach PGO about secondary indexes. Tests seem to show that it's mostly aligned with existing CQL workloads. The config YAML was copied from one of scylla-cluster-test test cases.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	e67f4a5c51	pgo: add a LWT workload This workload is added to teach PGO about LWT codepaths. Tests seem to show that it's mostly aligned with existing CQL workloads. The config YAML was copied from one of scylla-cluster-tests test cases.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	e217c124a6	pgo: add a decommission workload This workload is added to teach PGO about streaming. Tests show that this workload is mostly orthogonal to CQL workloads (where "orthogonal" means that training on workload A doesn't improve workload B much, while training on workload A doesn't improve workload B much), so adding it to the training is quite important.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	65abecaede	pgo: add a clustering workload In contrast to the basic workload, this workload uses clustering keys, CK range queries, RF=1, logged batches, and more CQL types. Tests seem to show that this workload is mostly aligned with the existing basic workload (where "aligned" means that training on workload A improves workload B about as much as training on workload B). The config YAML is based on the example YAML attached to cassandra-stress sources.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	c1297dbcd2	pgo: add a basic workload This commit adds the default cassandra-stress workload to the PGO training suite.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	f73b122de3	pgo: introduce a PGO training script Profile-guided optimization consists of the following steps: 1. Build the program as usual, but with with special options (instrumentation or just some supplementary info tables, depending on the exact flavor of PGO in use). 2. Collect an execution profile from the special binary by running a training workload on it. 3. Rebuild the program again, using the collected profile. This commit introduces a script automating step 2: running PGO training workloads on Scylla. The contents of training workloads will be added in future commits. The changes in configure.py responsible for steps 1. and 3. will also appear in future commits. As input, the script takes a path to the instrumented binary, a path to a the output file, and a directory with (optionally) prepopulated datasets for use in training. The output profile file can be then passed to the compiler to perform a PGO build. The script current supports two kinds of PGO instrumentation: LLVM instrumentation (binary instrumented with -fprofile-generate and -fcs-profile-generate passed to clang during compilation) and BOLT instrumentation (binary instrumented with `llvm-bolt -instrument`, with logs from this operation saved to $binary_path.boltlog) The actual training workloads for generating the profile will be added in later commits.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	6f01ceae3d	configure.py: don't include non-default modes in dist-server-* rules dist-server-tar only includes default modes. Let dist-server-deb and dist-server-rpm behave consistently with it.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	dd1a847d61	configure.py: enable LTO in release builds by default	2024-12-27 16:16:04 +08:00
Michał Chojnowski	4b03b91fbd	configure.py: introduce link-time optimization This patch introduces link-time optimization (LTO) to the build. The performance gains from LTO alone are modest (~7%), but it's vital ingredient of effective profile-guided optimization, which will be introduced later. In general, use of LTO is quite simple and transparent to build systems. It is sufficient to add the -flto flag to compile and link steps, and use a LTO-aware linker. At compile time, -ffat-lto-objects will cause the compiler to emit .o files both LTO-ready LLVM IR for main executable optimization and machine code for fast test linking. At link time, those pieces of IR will be compiled together, allowing cross-object optimization of the main executable and the fast linking of test executables. Due to it's high compile time cost, the optimization can be toggled with a configure.py option. As of this patch, it's disabled by default.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	192cb6de4b	configure.py: add a `default` to `add_tristate`. It will be used in the next patch.	2024-12-27 16:16:04 +08:00
Michał Chojnowski	1224200d7a	configure.py: unify build rules for cxxbridge .cc files and regular .cc files This is going to prevent some code duplication in following patches.	2024-12-27 16:16:04 +08:00
Benny Halevy	3e22998dc1	sstables: parse(summary): reserve positions vector We know the number of positions in advance so reserve the chunked_vector capacity for that. Note: reservation replaces the existing reset of the positions member. This is safe since we parse the summary only once as sstable::read_summary() returns early if the summary component is already populated. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#21767	2024-12-26 13:33:29 +02:00
Yaron Kaikov	bc487c9456	.github: cherry-pick each commit instead of merge commit when available Until today, when we had a PR with multiple commits we cherry-pick the merge commit only, which created a PR with only one commit (the merge commit) with all relevant changes This was causing an issue when there was a need to backport part of the commits like in https://github.com/scylladb/scylladb/pull/21990 (reported by @gleb-cloudius) Changing the logic to cherry-pick each commit Closes scylladb/scylladb#22027	2024-12-26 13:10:18 +02:00
Kefu Chai	6acc5294a4	treewide: migrate from boost::copy_range to std::ranges::to now that we are allowed to use C++23. we now have the luxury of using `std::ranges::to`. in this change, we: - replace `boost::copy_range` to `std::ranges::to` - remove unused `#include` of boost headers Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21880	2024-12-26 11:46:26 +02:00
Kefu Chai	6c031ad92f	test/topology: Percent-encode URL in pytest artifact links When embedding HTML documents in pytest reports with links to test artifacts, parameterized test names containing special characters like "[" and "]" can cause URL encoding issues. These characters, when used verbatim in URLs, can trigger HTTP 400 errors on web servers. This commit resolves the issue by percent-encoding the URLs for artifact links, ensuring compatibility with servers like Jenkins and preventing "HTTP ERROR 400 Illegal Path Character" errors. Changes: - Percent-encode test artifact URLs to handle special characters - Improve link robustness for parameterized test names Fixes scylladb/scylla-pkg#4599 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21963	2024-12-26 10:23:52 +02:00
Benny Halevy	a25c3eaa1c	utils: phased_barrier: add close() method When services are stopped we generally want to call advance_and_await(), but we should also prevent starting new operations, so close() would do that be closing the phased_barrier active gate (which implicitly also awaits past operations similar to advance_and_await()). Add unit tests for that and use in existing services. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-26 06:54:07 +02:00
Benny Halevy	311c52fbb1	utils: phased_barrier: advance_and_await: allocate new gate only when needed If there are no opearions in progress, there is no need to close the current gate and allocate a new one. The current gate can be reused for the new phase just as well. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-26 06:53:43 +02:00
Konstantin Osipov	d87e1eb7ef	test: merge topology_experimental_raft into topology_custom This enables tablets in topology_custom, so explicitly disable them where tests don't support tablets. In scope of this rename patch a few imports. Importing dependencies from another test is a bad idea - please use shared libraries instead. Fixed #20193 Closes scylladb/scylladb#22014	2024-12-26 00:33:08 +02:00
Yaron Kaikov	0fc7e786dd	.github/scripts/auto-backport.py: fix wrong username param In `2e6755ecca` I have added a comment when PR has conflicts so the assignee can get a notification about it. There was a problem with the user mention param (a missing `.login`) Fixing it Closes scylladb/scylladb#22036	2024-12-25 20:41:34 +02:00
Avi Kivity	465449e4a1	test: combined_test: relicense Was inadvertantly released under the AGPL.	2024-12-25 13:53:54 +02:00
Avi Kivity	3ffe93b6ae	Merge 'Enhance load-and-stream with "scope"' from Pavel Emelyanov The main purpose of this change is to enhance the restore from object storage usage. Currently, restore uses the load-and-stream facility. When triggered, the restoring task opens the provided list of sstables directory from the remote bucket and then feeds the list of sstables to load_and_stream() method. The method, in turn, iterates over this list, reads mutations and for each mutation decides where to send one by checking the replication map (it's pretty much the same for both vnodes and tablets, but for tablets that are "fully contained" by a range there's the plan to stream faster). As described above, restore is governed by a single node and this single node reads all sstables from the object store, which can be very slow. This PR allows speeding things up. For that, the load-and-stream code is equipped with the "scope" filter which limits where mutations can be streamed to. There are four options for that -- all, dc, rack and node. The "all" is how things work currently, "dc" and "rack" filter out target nodes that don't belong to this node's dc/rack respectively. The "node" scope only streams mutations to local node. With the "node" scope it's possible to make all nodes in the cluster load mutations that belong to them in parallel, without re-sending them to peers. The last patch in this PR is the test that shows how it can be possible. Closes scylladb/scylladb#21169 * github.com:scylladb/scylladb: test: Add scope-streaming test (for restore from backup) api: New "scope" API param to load-and-stream calls sstables_loader: Propagate scope from API down sstables_loader: Filter tablets based on scope streamer: Disable scoped streaming of primary replica only sstables_loader: Introduce streaming scope sstables_loader: Wrap get_endpoints()	2024-12-25 13:52:51 +02:00
Nadav Har'El	23213e8696	Merge 'Make get_built_indexes REST API endpoint be consistent with system."IndexInfo" table' from Pavel Emelyanov It turned out that aforementioned APIs use slightly different sources of information about view build progress/status which sometimes results in different reporting of whether an index is built. It's good to make those two APIs consistent. Also add a test for the REST API endpoint (system table test was addressed by #21677). Closes scylladb/scylladb#21814 * github.com:scylladb/scylladb: test: Add tests for MVs and indexes reporting by API endpoint(s) api: Use built_views table in get_built_indexes API	2024-12-25 11:47:03 +02:00
Evgeniy Naydanov	5992e8b031	test.py: topology_random_failures: more deselects for #21534 More cases found which can cause the same 'local_is_initialized()' assertion during the node's bootstrap.	2024-12-25 06:38:13 +00:00
Evgeniy Naydanov	f337ecbafa	test.py: topology_random_failures: handle more node's hangs during 30s sleep The node is hanging and the coordinator just rollback a topology state. It's different from `stop_after_sending_join_node_request` and `stop_after_bootstrapping_initial_raft_configuration` because in these cases the coordinator just not able to start the topology change at all and a message in the coordinator's log is different. Error injections handled: - `stop_after_updating_cdc_generation` - `stop_before_streaming` And, actually, it can be any cluster event which lasts more than 30s.	2024-12-25 06:38:13 +00:00
Avi Kivity	f9c3ab03a3	Merge 'Sort by proximity: shuffle equal-distance replicas' from Benny Halevy This series re-implements locator::topology::sort_by_proximity and adds some randomization to shuffle equal-distance replicas for improving load-balancing when reading with 1 < consistency level < replication factor. This change also adds a manual test for benchmarking sort_by_proximity, as it's not exercised by the single-node perf-simple-query. The benchmark shows performance improvement of over 20% (from about 71 ns to 56 ns per call for 3 nodes vectors), mainly due to "calculate distance only once" which pre-calculates the distance from the reference node for each replica once, rather than each time to comparator is called by std::sort * Improvement. No backport needed Closes scylladb/scylladb#21958 * github.com:scylladb/scylladb: locator/topology: do_sort_by_proximity: shuffle equal-distance replicas locator/topology: sort_by_proximity: calculate distance only once utils: small_vector: expose internal_capacity() storage_proxy: sort_endpoints_by_proximity: lookup my_id only if cannot sort by proximity test/perf: add perf_sort_by_proximity benchmark locator: refactor sort_by_proximity	2024-12-24 17:37:48 +02:00
Pavel Emelyanov	644d36996d	test: Add tests for MVs and indexes reporting by API endpoint(s) So far there's the /column_family/built_indexes one that reports the index names similar to how system.IndexInfo does, but it's not tested. This patch adds tests next to existing system. table ones. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-24 16:18:32 +03:00
Pavel Emelyanov	5eb3278d9e	api: Use built_views table in get_built_indexes API Somehow system."IndexInfo" table and column_family/built_indexes REST API endpoint declare an index "built" at slightly different times: The former a virtual table which declares an index completely built when it appears on the system.built_views table. The latter uses different data -- it takes the list of indexes in the schema and eliminates indexes which are still listed in the system.scylla_views_builds_in_progress table. The mentioned system. tables are updated at different times, so API notices the change a bit later. It's worth improving the consistency of these two APIs by making the REST API endpoint piggy-back the load_built_views() instead of load_view_build_progress(). With that change the filtering of indexes should be negated. Fixes #21587 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-24 16:18:00 +03:00
Benny Halevy	d1490bb7bf	locator/topology: do_sort_by_proximity: shuffle equal-distance replicas To improve balancing when reading in 1 < CL < ALL This implementation has a moderate impact on the function performance in contrast to full std::shuffle of the vector before stable_sort:ing it (especially with large number of nodes to sort). Before: test iterations median mad min max allocs tasks inst cycles sort_by_proximity_topology.perf_sort_by_proximity 25541973 39.225ns 0.114ns 38.966ns 39.339ns 0.000 0.000 588.5 116.6 After: sort_by_proximity_topology.perf_sort_by_proximity 19689561 50.195ns 0.119ns 50.076ns 51.145ns 0.000 0.000 622.5 150.6 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-24 13:00:17 +02:00
Benny Halevy	0fe8bdd0db	locator/topology: sort_by_proximity: calculate distance only once And use a temporary vector to use the precalculated distances. A later patch will add some randomization to shuffle nodes at the same distance from the reference node. This improves the function performance by 50% for 3 replicas, from 77.4 ns to 39.2 ns, larger replica sets show greater improvement (over 4X for 15 nodes): Before: test iterations median mad min max allocs tasks inst cycles sort_by_proximity_topology.perf_sort_by_proximity 12808773 77.368ns 0.062ns 77.300ns 77.873ns 0.000 0.000 1194.2 231.6 After: sort_by_proximity_topology.perf_sort_by_proximity 25541973 39.225ns 0.114ns 38.966ns 39.339ns 0.000 0.000 588.5 116.6 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-24 12:27:03 +02:00
Benny Halevy	4af522f61e	utils: small_vector: expose internal_capacity() So we can use it for defining other small_vector deriving their internal capacity from another small_vector type. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-24 12:19:20 +02:00
Benny Halevy	3a3df43799	storage_proxy: sort_endpoints_by_proximity: lookup my_id only if cannot sort by proximity topology::sort_by_proximity already sorts the local node address first, if present, so look it up only when using SimpleSnitch, where sort_by_proximity() is a no-op. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-24 12:19:20 +02:00
Benny Halevy	75da99ce8b	test/perf: add perf_sort_by_proximity benchmark benchmark sort_by_proximity Baseline results on my desktop for sorting 3 nodes: single run iterations: 0 single run duration: 1.000s number of runs: 5 number of cores: 1 random seed: 20241224 test iterations median mad min max allocs tasks inst cycles sort_by_proximity_topology.perf_sort_by_proximity 12808773 77.368ns 0.062ns 77.300ns 77.873ns 0.000 0.000 1194.2 231.6 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-24 12:18:24 +02:00
Michał Chojnowski	5ce1e4410f	message/dictionary_service: introduce dictionary_service This "service" is a bag for code responsible for dictionary training, created to unclutter main() from dictionary-specific logic. It starts the RPC dictionary training loop when the relevant cluster feature is enabled, pauses and unpauses it appropriately whenever relevant config or leadership status are updated, and publishes new dictionaries whenever the training fiber produces them.	2024-12-23 23:37:02 +01:00
Michał Chojnowski	6a982ee0dc	service: make Raft group 0 aware of system.dicts Adds glue which causes the contents of system.dicts to be sent in group 0 snapshots, and causes a callback to be called when system.dicts is updated locally. The callback is currently empty and will be hooked up to the RPC compressor tracker in one of the next commits.	2024-12-23 23:37:02 +01:00
Michał Chojnowski	cc15ca329e	db/system_keyspace: add system.dicts Adds a new system table which will act as the medium for distributing compression dictionaries over the cluster. This table will be managed by Raft (group 0). It will be hooked up to it in follow-up commits.	2024-12-23 23:37:02 +01:00
Michał Chojnowski	0fd1050784	utils: add advanced_rpc_compressor Adds glue needed to pass lz4 and zstd with streaming and/or dictionaries as the network traffic compressors for Seastar's RPC servers. The main jobs of this glue are: 1. Implementing the API expected by Seastar from RPC compressors. 2. Expose metrics about the effectiveness of the compression. 3. Allow dynamically switching algorithms and dictionaries on a running connection, without any extra waits. The biggest design decision here is that the choice of algorithm and dictionary is negotiated by both sides of the connection, not dictated unilaterally by the sender. The negotiation algorithm is fairly complicated (a TLA+ model validating it is included in the commit). Unilateral compression choice would be much simpler. However, negotiation avoids re-sending the same dictionary over every connection in the cluster after dictionary updates (with one-way communication, it's the only reliable way to ensure that our receiver possesses the dictionary we are about to start using), lets receivers ask for a cheaper compression mode if they want, and lets them refuse to update a dictionary if they don't think they have enough free memory for that. In hindsight, those properties probably weren't worth the extra complexity and extra development effort. Zstd can be quite expensive, so this patch also includes a mechanism which temporarily downgrades the compressor from zstd to lz4 if zstd has been using too much CPU in a given slice of time. But it should be noted that this can't be treated as a reliable "protection" from negative performance effects of zstd, since a downgrade can happen on the sender side, and receivers are at the mercy of senders.	2024-12-23 23:37:02 +01:00
Michał Chojnowski	5294762ac7	utils: add dict_trainer	2024-12-23 23:37:02 +01:00
Michał Chojnowski	9de52b1c98	utils: introduce reservoir_sampling We are planning to improve some usages of compression in Scylla (in which we compress small blocks of data) by pre-training compression dictionaries on similar data seen so far. For example, many RPC messages have similar structure (and likely similar data), so the similarity could be exploited for better compression. This can be achieved e.g. by training a dictionary on the RPC traffic, and compressing subsequent RPC messages against that dictionary. To work well, the training should be fed a representative sample of the compressible data. Such a sample can be approached by taking a random subset (of some given reasonable size) of the data, with uniform probability. For our purposes, we need an online algorithm for this -- one which can select the random k-subset from a stream of arbitrary size (e.g. all RPC traffic over an hour), while requiring only the necessary minimum of memory. This is a known problem, called "reservoir sampling". This PR introduces `reservoir_sampler`, which implements an optimal algorithm for reservoir sampling. Additionally, it introduces `page_sampler` -- a wrapper for `reservoir_sampler`, which uses it to select a random sample of pages from a stream of bytes.	2024-12-23 23:37:02 +01:00
Michał Chojnowski	d301c29af5	utils: introduce alien_worker Introduces a util which launches a new OS thread and accepts callables for concurrent execution. Meant to be created once at startup and used until shutdown, for running nonpreemptible, 3rd party, non-interactive code. Note: this new utility is almost identical to wasm::alien_thread_runner. Maybe we should unify them.	2024-12-23 23:37:02 +01:00
Michał Chojnowski	866326efe4	utils: add stream_compressor Adds utilities for "advanced" methods of compression with lz4 and zstd -- with streaming (a history buffer persisted across messages) and/or precomputed dictionaries. This patch is mostly just glue needed to use the underlying libraries with discontiguous input and output buffers, and for reusing the same compressor context objects across messages. It doesn't contain any innovations of its own. There is one "design decision" in the patch. The block format of LZ4 doesn't contain the length of the compressed blocks. At decompression time, that length must be delivered to the decompressor by a channel separate to the compressed block itself. In `lz4_cstream`, we deal with that by prepending a variable-length integer containing the compressed size to each compressed block. This is suboptimal for single-fragment messages, since the user of lz4_cstream is likely going to remember the length of the whole message anyway, which makes the length prepended to the block redundant. But a loss of 1 byte is probably acceptable for most uses.	2024-12-23 23:28:12 +01:00
Pavel Emelyanov	972ff80fad	test: Add scope-streaming test (for restore from backup) - create - a cluster with given topology - keyspace with tablets and given rf value - table with some data - backup - flush all nodes - kick backup API on every node - re-create keyspace and table - drop it first - create again with the same parameters and schema, but don't populate table with data - restore - collect nodes to contact and corresponding list of TOCs according to the preferred "scope" - ask selected nodes to restore, limiting its streaming scope and providing the specific list of sstables - check - select mutation fragments from all nodes for random keys - make sure that the number of non-empty responses equals the expected rf value Specific topologies, RFs and stream scopes used are: rf = 1, nodes = 3, racks = 1, dcs = 1, scope = node rf = 3, nodes = 5, racks = 1, dcs = 1, scope = node rf = 1, nodes = 4, racks = 2, dcs = 1, scope = rack rf = 3, nodes = 6, racks = 2, dcs = 1, scope = rack rf = 3, nodes = 6, racks = 3, dcs = 1, scope = rack rf = 2, nodes = 8, racks = 4, dcs = 2, scope = dc nodes and racks are evenly distributed in racks and dcs respectively in the last topo RF effectively becomes 4 (2 in each dc) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-23 19:28:05 +03:00
Pavel Emelyanov	a24dc02255	api: New "scope" API param to load-and-stream calls There are two of those -- the POST /storage_service/keyspace that loads and streams new sstables from /upload and POST /storage_service/restore that does the same, but gets sstables from object store. The new optional parameter allow users to tun the streaming phase behavior. The test/pylib client part is also updated here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-23 19:28:05 +03:00
Pavel Emelyanov	960041d4b4	sstables_loader: Propagate scope from API down Semi-mechanical change that adds newly introduced "scope" parameter to all the functions between API methods and the low-level streamer object. No real functional changes. API methods set it to "all" to keep existing behavior. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-23 19:28:05 +03:00
Pavel Emelyanov	e8201a7897	sstables_loader: Filter tablets based on scope Loading and streaming tablets has pre-filtering loop that walks the tablet map sorts sstables into three lists: - fully contained in one of map ranges - partially overlapping with the map - not intersecting with the map Sstables from the 3rd list is immediately dropped from the process and for the remaining two core load-and-stream happens. This filtering deserves more care from the newly introduced scope. When a tablet replica set doesn't get in the scope, the whole entry can be disregarded, because load-and-stream will only do its "load" part anyway and all mutations from it will be ignored. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-23 19:28:05 +03:00
Pavel Emelyanov	93aed22cd5	streamer: Disable scoped streaming of primary replica only There's been some discussions of how primary replica only streaming schould interact with the scope. There are two options how to consider this combination: - find where the primary replica is and handle it if it's within the requested sope - within the requested scope find the primary replica for that subset of nodes, then handle it There's also some itermediate solution: suppoer "primary replica in DC" and reject all other combinations. Until decided which way is correct, let's disable this configuration. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-23 19:19:17 +03:00
Pavel Emelyanov	30aac0d1da	sstables_loader: Introduce streaming scope Currently load-and-stream sends mutations to whatever node is considered to be a "replica" for it. One exception is the "primary-replica-only" flag that can be requested by the user. This patch introduces a "scope" parameter that limits streaming part in where it can stream the data to with 4 options: - all -- current way of doing things, stream to wherever needed - dc -- only stream to nodes that live in the same datacenter - rack -- only stream to nodes that live in the same rack - node -- only "stream" to current node It's not yet configurable and streamer object initializes itself with "all" mode. Will be changed later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-23 19:19:17 +03:00
Pavel Emelyanov	7c1eaa427e	sstables_loader: Wrap get_endpoints() Preparational patch. Next will add more code to get_endpoints() that will need to work for both if/else branches, this change helps having less churn later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-23 19:19:17 +03:00
Benny Halevy	68b0b442fd	locator: refactor sort_by_proximity Extract can_sort_by_proximity() out so it can be used later by storage_proxy, and introduce do_sort_by_proximity that sorts unconditionally. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-23 16:42:55 +02:00
Amnon Heiman	48f7ef1c30	alternator/executor.cc: Add WCU for update_item This patch adds WCU support for update_item. The way Alternator modifies values means we don't always have the full item sizes. When there is a read-before-write, the code in rmw_operation takes care of the object size. When updating a value without read-before-write, we will make a rough estimation of the value's size. This is better than simply taking 1 (as we do with delete) and is also more Alternator-like.	2024-12-20 14:55:55 +02:00
Aleksandra Martyniuk	da7301679b	test: truncate the table before node ops task checks Truncate a table before testing node ops tasks to check if the truncate request won't be considered by node_ops_virtual_task.	2024-12-20 12:26:42 +01:00
Aleksandra Martyniuk	ee4bd287fd	node_ops: rename a method that get node ops entries	2024-12-20 12:25:48 +01:00
Aleksandra Martyniuk	a7fc566c7e	node_ops: filter topology_requests entries Currently node_ops_virtual_task shows stats of all system.topology_request entries. However, the table also contains info about non-node_ops requests, e.g. truncate. Filter the entries used by node_ops_virtual_task by their type. With this change bootstrap of the first node will not be visible. Update the test accordingly.	2024-12-20 12:20:42 +01:00
Dawid Mędrek	461a6b129c	docs: Update documentation on CREATE ROLE WITH HASHED PASSWORD As part of #18750, we added a CQL statement CREATE ROLE WITH SALTED HASH that prevented hashing a password when creating a role, effectively leading to inserting a hash given by the user directly into the database. In #21350, we noticed that Cassandra had implemented a CQL statement of similar semantics but different syntax. We decided to rename Scylla's statement to be compatible with Cassandra. Unfortunately, we didn't notice one more difference between what we had in Scylla and what was part of Cassandra. Scylla's statement was originally supposed to only be used when restoring the schema and the user needn't have to be aware of its existence at all: the database produced a sequence of CQL statements that the user saved to a file and when a need to restore the schema arose, they would execute the contents of the file. That's why that although we documented the feature, it was only done in the necessary places. Those that weren't related to the backup & restore procedure were deliberately skipped. Cassandra, on the other hand, added the statement for a different purpose (for details, see the relevant issue) and it was supposed to be used by the user by design. The statement is also documented as such. Since we want to preserve compatibility with Cassandra, we document the statement and its semantics in the user documentation, explicitly implying that it can be used by the user. Fixes scylladb/scylladb#21691	2024-12-17 13:43:36 +01:00
Dawid Mędrek	e365653560	test/boost: Add test for creating roles with hashed passwords We add a new test verifying that after creating a role with a hashed password using one of the supported encryption algorithms: bcrypt, sha256, sha512, or md5, the user can successfully log in.	2024-12-17 13:42:15 +01:00
Tomasz Grabiec	e732ff7cd8	tablets: load_balancer: Fail when draining with no candidate nodes If we're draining the last node in a DC, we won't have a chance to evaluate candidates and notice that constraints cannot be satisfied (N < RF). Draining will succeed and node will be removed with replicas still present on that node. This will cause later draining in the same DC to fail when we will have 2 replicas which need relocaiton for a given tablet. The expected behvior is for draining to fail, because we cannot keep the RF in the DC. This is consistent, for example, with what happens when removing a node in a 2-node cluster with RF=2. Fixes #21826	2024-12-17 12:14:18 +01:00
Tomasz Grabiec	8718450172	tablets: load_balancer: Ignore skip_list when draining When doing normal load balancing, we can ignore DOWN nodes in the node set and just balance the UP nodes among themselves because it's ok to equalize load just in that set, it improves the situation. It's dangerous to do that when draining because that can lead to overloading of the UP nodes. In the worst case, we can have only one non-drained node in the UP set, which would receive all the tablets of the drained node, doubling its load. It's safer to let the drain fail or stall. This is decided by topology coordinator, currently we will fail (on barrier) and rollback.	2024-12-17 12:14:18 +01:00
Tomasz Grabiec	2de3c079b2	tablets: topology_coordinator: Keep tablet_draining transition if nodes are not drained Empty plan with nodes to drain meant that we can exit tablet_draining transition and move to the next stage of decommission/removenode. In case tablet scheduler creates an empty plan for some reason but there are still underained tablets, that could put topology in an invalid state. For example, this can currently happen if there are no non-draining nodes in a DC. This patch adds a safety net in the topology coordinator which prevents moving forward with undrained tablets.	2024-12-16 16:54:59 +01:00
Benny Halevy	8832301fe0	storage_service: replicate_to_all_cores: clear_gently pending erms In case the update is rolled back on error, call clear_gently for table_erms and view_erms to prevent potential stalls with a large number of tables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-15 12:17:28 +02:00
Benny Halevy	500ca17370	test_mv_topology_change: drop delay_after_erm_update injection case After last patch, we deliberately don't yield between update of base table erm and updating its view, which was the scenario tested with the `delay_after_erm_update` error injection point. Instead, call maybe_yield in between base/views updates to prevent reactor stalls with many tables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-15 12:11:43 +02:00
Benny Halevy	4bfa3060d0	storage_service: replicate_to_all_cores: update base and view tables atomically Currently, the loop updating all tables (including views) with the new effective_replication_map may yield, and therefore expose a state where the base and view tables effective_replication_map and topology are out of sync (as seen in scylladb/scylladb#17786) To prevent that, loop over all base tables and for each table update the base table and all views atomically, without yielding, and so allow yielding only between base tables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-15 12:11:41 +02:00
Benny Halevy	10c4cf930c	table: make update_effective_replication_map sync again Commit `f2ff701489` introduced a yield in update_effective_replication_map that might cause the storage_group manager to be inconsistent with the new effective_replication_map (e.g. if yielding right before calling `handle_tablet_split_completion`. Also, yielding inside storage_service::replicate_to_all_cores update loop means that base tables and their views aren't updated atomically, that caused scylladb/scylladb#17786 This change essentially reverts `f2ff701489` and makes handle_tablet_split_completion synchronous too. The stopped compaction groups future is kept as a memebr and storage_group_manager::stop() consumes this future during table::stop(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-15 11:45:08 +02:00
Pavel Emelyanov	3081ce24cd	nodetool: Implement [gs]etstreamthroughput commands They exist in the original documentation, but are not yet implemented. Now it's possible to do it. It slightly more complex that its compaction counterpart in a sense than get method reports megabits/s by default and has an option to convert to MiBs. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-13 14:39:47 +03:00
Pavel Emelyanov	67089fd5a1	nodetool: Implement [gs]etcompationthroughput commands They exist in the original documentation, but are not yet implemented. Now it's possible to do it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-13 14:39:47 +03:00
Pavel Emelyanov	eb29d6f4b0	test: Add validation of how IO-updating endpoints work There are now four of those and these are all the same in the way they interpret the value parameter (though it's named differently) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-13 13:02:44 +03:00
Pavel Emelyanov	fa1ad5ecfd	api: Implement /storage_service/(stream\|compaction)_throughput endpoints Both values are in fact db::config named values. They are observed by, respectively, compaction manager and stream manager: when changed, the observer kicks corresponding sched group's update_io_bandwidth() method. Despite being referenced by managers, there's no way to update those values anyhow other than updating config's named values themselves. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-13 11:51:52 +03:00
Pavel Emelyanov	6659ceca4f	api: Disqualify const config reference Some endpoints in config block will need to actually _update_ values on config (see next patches why), and const reference stands on the way. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-13 11:51:52 +03:00
Pavel Emelyanov	f3775ba957	api: Implement /storage_service/stream_throughput endpoint The value can be obtained from the stream_manager Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-13 11:51:52 +03:00
Pavel Emelyanov	b8bd170212	api: Move stream throughput set/get endpoints from storage service block In order to get stream throughput, the API will need stream_manager. In order to set stream throughput, the API will need db::config to update the corresponding named value on it. Said that, move the endpoints to relevant blocks. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-13 11:51:52 +03:00
Pavel Emelyanov	d2c9c2abe8	api: Move set_compaction_throughput_mb_per_sec to config block In order to update compaction throughput API would need to update the db::config value, so the endpoint in question should sit in the block that has db::config at hand. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-13 11:51:52 +03:00
Pavel Emelyanov	7d6f8d728b	util: Include fmt/ranges.h in config_file.hh The operator() of named_value() prints the allowed values on error which can be a vector, so the ranges formatting should be there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-12-13 11:51:52 +03:00

2564 changed files with 202431 additions and 52176 deletions

1

.gitattributes vendored

View File

@@ -2,3 +2,4 @@
 *.hh diff=cpp
 *.svg binary
 docs/_static/api/js/* binary
 pgo/profiles/** filter=lfs diff=lfs merge=lfs -text

14

.github/CODEOWNERS vendored

View File

@@ -1,5 +1,5 @@
 # AUTH
 auth/* @nuivall @ptrsmrn @KrzaQ
 auth/* @nuivall @ptrsmrn
 # CACHE
 row_cache* @tgrabiec
@@ -25,15 +25,15 @@ compaction/* @raphaelsc
 transport/*
 # CQL QUERY LANGUAGE
 cql3/* @tgrabiec @nuivall @ptrsmrn @KrzaQ
 cql3/* @tgrabiec @nuivall @ptrsmrn
 # COUNTERS
 counters* @nuivall @ptrsmrn @KrzaQ
 tests/counter_test* @nuivall @ptrsmrn @KrzaQ
 counters* @nuivall @ptrsmrn
 tests/counter_test* @nuivall @ptrsmrn
 # DOCS
 docs/* @annastuchlik @tzach
 docs/alternator @annastuchlik @tzach @nyh @nuivall @ptrsmrn @KrzaQ
 docs/alternator @annastuchlik @tzach @nyh
 # GOSSIP
 gms/* @tgrabiec @asias @kbr-scylla
@@ -74,8 +74,8 @@ streaming/* @tgrabiec @asias
 service/storage_service.* @tgrabiec @asias
 # ALTERNATOR
 alternator/* @nyh @nuivall @ptrsmrn @KrzaQ
 test/alternator/* @nyh @nuivall @ptrsmrn @KrzaQ
 alternator/* @nyh
 test/alternator/* @nyh
 # HINTED HANDOFF
 db/hints/* @piodul @vladzcloudius @eliransin

									
										97

.github/ISSUE_TEMPLATE/bug_report.yml
									
										vendored
									
												View File
												
				@@ -1,15 +1,86 @@

				This is Scylla's bug tracker, to be used for reporting bugs only.

				name: "Report a bug"

				description: "File a bug report."

				title: "[Bug]: "

				type: "bug"

				labels: bug

				body:

				  - type: checkboxes

				    id: terms

				    attributes:

				      label: Code of Conduct

				      description: "This is Scylla's bug tracker, to be used for reporting bugs only.

				If you have a question about Scylla, and not a bug, please ask it in

				our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.

				our forum at https://forum.scylladb.com/ or in our slack channel https://slack.scylladb.com/ "

				      options:

				        - label: I have read the disclaimer above and am reporting a suspected malfunction in Scylla.

				          required: true

				- [] I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

				*Installation details*

				Scylla version (or git commit hash):

				Cluster size:

				OS (RHEL/CentOS/Ubuntu/AWS AMI):

				*Hardware details (for performance issues)*          Delete if unneeded

				Platform (physical/VM/cloud instance type/docker):

				Hardware: sockets= cores= hyperthreading= memory=

				Disks: (SSD/HDD, count)

				  - type: input

				    id: product-version

				    attributes:

				      label: product version

				      description: Scylla version (or git commit hash)

				      placeholder: ex. scylla-6.1.1

				    validations:

				      required: true

				  - type: input

				    id: cluster-size

				    attributes:

				      label: Cluster Size

				    validations:

				      required: true  

				  - type: input

				    id: os

				    attributes:

				      label: OS

				      placeholder: RHEL/CentOS/Ubuntu/AWS AMI

				    validations:

				      required: true

				  - type: textarea

				    id: additional-data

				    attributes:

				      label: Additional Environmental Data

				      #description: 

				      placeholder: Add additional data

				      value: "Platform (physical/VM/cloud instance type/docker):\n

				Hardware: sockets=   cores=   hyperthreading=   memory=\n

				Disks: (SSD/HDD, count)"

				    validations:

				      required: false

				  - type: textarea

				    id: reproducer-steps

				    attributes:

				      label: Reproduction Steps

				      placeholder: Describe how to reproduce the problem

				      value: "The steps to reproduce the problem are:"

				    validations:

				      required: true

				  - type: textarea

				    id: the-problem

				    attributes:

				      label: What is the problem?

				      placeholder: Describe the problem you found

				      value: "The problem is that"

				    validations:

				      required: true

				  - type: textarea

				    id: what-happened

				    attributes:

				      label: Expected behavior?

				      placeholder: Describe what should have happened

				      value: "I expected that "

				    validations:

				      required: true

				  - type: textarea

				    id: logs

				    attributes:

				      label: Relevant log output

				      description: Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks.

				      render: shell

									
										86

.github/copilot-instructions.md
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,86 @@

				# ScyllaDB Development Instructions

				## Project Context

				High-performance distributed NoSQL database. Core values: performance, correctness, readability.

				## Build System

				### Modern Build (configure.py + ninja)

				```bash

				# Configure (run once per mode, or when switching modes)

				./configure.py --mode=<mode>  # mode: dev, debug, release, sanitize

				# Build everything

				ninja <mode>-build  # e.g., ninja dev-build

				# Build Scylla binary only (sufficient for Python integration tests)

				ninja build/<mode>/scylla

				# Build specific test

				ninja build/<mode>/test/boost/<test_name>

				```

				## Running Tests

				### C++ Unit Tests

				```bash

				# Run all tests in a file

				./test.py --mode=<mode> test/<suite>/<test_name>.cc

				# Run a single test case from a file

				./test.py --mode=<mode> test/<suite>/<test_name>.cc::<test_case_name>

				# Examples

				./test.py --mode=dev test/boost/memtable_test.cc

				./test.py --mode=dev test/raft/raft_server_test.cc::test_check_abort_on_client_api

				```

				**Important:** 

				- Use full path with `.cc` extension (e.g., `test/boost/test_name.cc`, not `boost/test_name`)

				- To run a single test case, append `::<test_case_name>` to the file path

				- If you encounter permission issues with cgroup metric gathering, add `--no-gather-metrics` flag

				**Rebuilding Tests:**

				- test.py does NOT automatically rebuild when test source files are modified

				- Many tests are part of composite binaries (e.g., `combined_tests` in test/boost contains multiple test files)

				- To find which binary contains a test, check `configure.py` in the repository root (primary source) or `test/<suite>/CMakeLists.txt`

				- To rebuild a specific test binary: `ninja build/<mode>/test/<suite>/<binary_name>`

				- Examples: 

				  - `ninja build/dev/test/boost/combined_tests` (contains group0_voter_calculator_test.cc and others)

				  - `ninja build/dev/test/raft/replication_test` (standalone Raft test)

				### Python Integration Tests

				```bash

				# Only requires Scylla binary (full build usually not needed)

				ninja build/<mode>/scylla

				# Run all tests in a file

				./test.py --mode=<mode> <test_path>

				# Run a single test case from a file

				./test.py --mode=<mode> <test_path>::<test_function_name>

				# Examples

				./test.py --mode=dev alternator/

				./test.py --mode=dev cluster/test_raft_voters::test_raft_limited_voters_retain_coordinator

				# Optional flags

				./test.py --mode=dev cluster/test_raft_no_quorum -v  # Verbose output

				./test.py --mode=dev cluster/test_raft_no_quorum --repeat 5  # Repeat test 5 times

				```

				**Important:**

				- Use path without `.py` extension (e.g., `cluster/test_raft_no_quorum`, not `cluster/test_raft_no_quorum.py`)

				- To run a single test case, append `::<test_function_name>` to the file path

				- Add `-v` for verbose output

				- Add `--repeat <num>` to repeat a test multiple times

				- After modifying C++ source files, only rebuild the Scylla binary for Python tests - building the entire repository is unnecessary

				## Code Philosophy

				- Performance matters in hot paths (data read/write, inner loops)

				- Self-documenting code through clear naming

				- Comments explain "why", not "what"

				- Prefer standard library over custom implementations

				- Strive for simplicity and clarity, add complexity only when clearly justified

				- Question requests: don't blindly implement requests - evaluate trade-offs, identify issues, and suggest better alternatives when appropriate

				- Consider different approaches, weigh pros and cons, and recommend the best fit for the specific context

									
										115

.github/instructions/cpp.instructions.md
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,115 @@

				---

				applyTo: "**/*.{cc,hh}"

				---

				# C++ Guidelines

				**Important:** Always match the style and conventions of existing code in the file and directory.

				## Memory Management

				- Prefer stack allocation whenever possible

				- Use `std::unique_ptr` by default for dynamic allocations

				- `new`/`delete` are forbidden (use RAII)

				- Use `seastar::lw_shared_ptr` or `seastar::shared_ptr` for shared ownership within same shard

				- Use `seastar::foreign_ptr` for cross-shard sharing

				- Avoid `std::shared_ptr` except when interfacing with external C++ APIs

				- Avoid raw pointers except for non-owning references or C API interop

				## Seastar Asynchronous Programming

				- Use `seastar::future<T>` for all async operations

				- Prefer coroutines (`co_await`, `co_return`) over `.then()` chains for readability

				- Coroutines are preferred over `seastar::do_with()` for managing temporary state

				- In hot paths where futures are ready, continuations may be more efficient than coroutines

				- Chain futures with `.then()`, don't block with `.get()` (unless in `seastar::thread` context)

				- All I/O must be asynchronous (no blocking calls)

				- Use `seastar::gate` for shutdown coordination

				- Use `seastar::semaphore` for resource limiting (not `std::mutex`)

				- Break long loops with `maybe_yield()` to avoid reactor stalls

				## Coroutines

				```cpp

				seastar::future<T> func() {

				    auto result = co_await async_operation();

				    co_return result;

				}

				```

				## Error Handling

				- Throw exceptions for errors (futures propagate them automatically)

				- In data path: avoid exceptions, use `std::expected` (or `boost::outcome`) instead

				- Use standard exceptions (`std::runtime_error`, `std::invalid_argument`)

				- Database-specific: throw appropriate schema/query exceptions

				## Performance

				- Pass large objects by `const&` or `&&` (move semantics)

				- Use `std::string_view` for non-owning string references

				- Avoid copies: prefer move semantics

				- Use `utils::chunked_vector` instead of `std::vector` for large allocations (>128KB)

				- Minimize dynamic allocations in hot paths

				## Database-Specific Types

				- Use `schema_ptr` for schema references

				- Use `mutation` and `mutation_partition` for data modifications

				- Use `partition_key` and `clustering_key` for keys

				- Use `api::timestamp_type` for database timestamps

				- Use `gc_clock` for garbage collection timing

				## Style

				- C++23 standard (prefer modern features, especially coroutines)

				- Use `auto` when type is obvious from RHS

				- Avoid `auto` when it obscures the type

				- Use range-based for loops: `for (const auto& item : container)`

				- Use standard algorithms when they clearly simplify code (e.g., replacing 10-line loops)

				- Avoid chaining multiple algorithms if a straightforward loop is clearer

				- Mark functions and variables `const` whenever possible

				- Use scoped enums: `enum class` (not unscoped `enum`)

				## Headers

				- Use `#pragma once`

				- Include order: own header, C++ std, Seastar, Boost, project headers

				- Forward declare when possible

				- Never `using namespace` in headers (exception: `using namespace seastar` is globally available via `seastarx.hh`)

				## Documentation

				- Public APIs require clear documentation

				- Implementation details should be self-evident from code

				- Use `///` or Doxygen `/** */` for public documentation, `//` for implementation notes - follow the existing style

				## Naming

				- `snake_case` for most identifiers (classes, functions, variables, namespaces)

				- Template parameters: `CamelCase` (e.g., `template<typename ValueType>`)

				- Member variables: prefix with `_` (e.g., `int _count;`)

				- Structs (value-only): no `_` prefix on members

				- Constants and `constexpr`: `snake_case` (e.g., `static constexpr int max_size = 100;`)

				- Files: `.hh` for headers, `.cc` for source

				## Formatting

				- 4 spaces indentation, never tabs

				- Opening braces on same line as control structure (except namespaces)

				- Space after keywords: `if (`, `while (`, `return `

				- Whitespace around operators matches precedence: `*a + *b` not `* a+* b`

				- Line length: keep reasonable (<160 chars), use continuation lines with double indent if needed

				- Brace all nested scopes, even single statements

				- Minimal patches: only format code you modify, never reformat entire files

				## Logging

				- Use structured logging with appropriate levels: DEBUG, INFO, WARN, ERROR

				- Include context in log messages (e.g., request IDs)

				- Never log sensitive data (credentials, PII)

				## Forbidden

				- `malloc`/`free`

				- `printf` family (use logging or fmt)

				- Raw pointers for ownership

				- `using namespace` in headers

				- Blocking operations: `std::sleep`, `std::read`, `std::mutex` (use Seastar equivalents)

				- `std::atomic` (reserved for very special circumstances only)

				- Macros (use `inline`, `constexpr`, or templates instead)

				## Testing

				When modifying existing code, follow TDD: create/update test first, then implement.

				- Examine existing tests for style and structure

				- Use Boost.Test framework

				- Use `SEASTAR_THREAD_TEST_CASE` for Seastar asynchronous tests

				- Aim for high code coverage, especially for new features and bug fixes

				- Maintain bisectability: all tests must pass in every commit. Mark failing tests with `BOOST_FAIL()` or similar, then fix in subsequent commit

									
										51

.github/instructions/python.instructions.md
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,51 @@

				---

				applyTo: "**/*.py"

				---

				# Python Guidelines

				**Important:** Match existing code style. Some directories (like `test/cqlpy` and `test/alternator`) prefer simplicity over type hints and docstrings.

				## Style

				- Follow PEP 8

				- Use type hints for function signatures (unless directory style omits them)

				- Use f-strings for formatting

				- Line length: 160 characters max

				- 4 spaces for indentation

				## Imports

				Order: standard library, third-party, local imports

				```python

				import os

				import sys

				import pytest

				from cassandra.cluster import Cluster

				from test.utils import setup_keyspace

				```

				Never use `from module import *`

				## Documentation

				All public functions/classes need docstrings (unless the current directory conventions omit them):

				```python

				def my_function(arg1: str, arg2: int) -> bool:

				    """

				    Brief summary of function purpose.

				    Args:

				        arg1: Description of first argument.

				        arg2: Description of second argument.

				    Returns:

				        Description of return value.

				    """

				    pass

				```

				## Testing Best Practices

				- Maintain bisectability: all tests must pass in every commit

				- Mark currently-failing tests with `@pytest.mark.xfail`, unmark when fixed

				- Use descriptive names that convey intent

				- Docstrings/comments should explain what the test verifies and why, and if it reproduces a specific issue or how it fits into the larger test suite

									
										115

.github/scripts/auto-backport.py
									
										vendored
									
												View File
												
				@@ -29,10 +29,11 @@ def parse_args():

				    parser.add_argument('--commits', default=None, type=str, help='Range of promoted commits.')

				    parser.add_argument('--pull-request', type=int, help='Pull request number to be backported')

				    parser.add_argument('--head-commit', type=str, required=is_pull_request(), help='The HEAD of target branch after the pull request specified by --pull-request is merged')

				    parser.add_argument('--github-event', type=str, help='Get GitHub event type')

				    return parser.parse_args()

				def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr_title, commits, is_draft=False):

				def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr_title, commits, is_draft, is_collaborator):

				    pr_body = f'{pr.body}\n\n'

				    for commit in commits:

				        pr_body += f'- (cherry picked from commit {commit})\n\n'

				@@ -46,12 +47,29 @@ def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr

				            draft=is_draft

				        )

				        logging.info(f"Pull request created: {backport_pr.html_url}")

				        backport_pr.add_to_assignees(pr.user)

				        labels_to_add = []

				        priority_labels = {"P0", "P1"}

				        parent_pr_labels = [label.name for label in pr.labels]

				        for label in priority_labels:

				            if label in parent_pr_labels:

				                labels_to_add.append(label)

				                labels_to_add.append("force_on_cloud")

				                logging.info(f"Adding {label} and force_on_cloud labels from parent PR to backport PR")

				                break  # Only apply the highest priority label

				        if is_collaborator:

				            backport_pr.add_to_assignees(pr.user)

				        if is_draft:

				            backport_pr.add_to_labels("conflicts")

				            pr_comment = f"@{pr.user} - This PR was marked as draft because it has conflicts\n"

				            labels_to_add.append("conflicts")

				            pr_comment = f"@{pr.user.login} - This PR was marked as draft because it has conflicts\n"

				            pr_comment += "Please resolve them and mark this PR as ready for review"

				            backport_pr.create_issue_comment(pr_comment)

				        # Apply all labels at once if we have any

				        if labels_to_add:

				            backport_pr.add_to_labels(*labels_to_add)

				            logging.info(f"Added labels to backport PR: {labels_to_add}")

				        logging.info(f"Assigned PR to original author: {pr.user}")

				        return backport_pr

				    except GithubException as e:

				@@ -66,7 +84,8 @@ def get_pr_commits(repo, pr, stable_branch, start_commit=None):

				    if pr.merged:

				        merge_commit = repo.get_commit(pr.merge_commit_sha)

				        if len(merge_commit.parents) > 1:  # Check if this merge commit includes multiple commits

				            commits.append(pr.merge_commit_sha)

				            for commit in pr.get_commits():

				                commits.append(commit.sha)

				        else:

				            if start_commit:

				                promoted_commits = repo.compare(start_commit, stable_branch).commits

				@@ -91,18 +110,7 @@ def get_pr_commits(repo, pr, stable_branch, start_commit=None):

				    return commits

				def create_pr_comment_and_remove_label(pr, comment_body):

				    labels = pr.get_labels()

				    pattern = re.compile(r"backport/\d+\.\d+$")

				    for label in labels:

				        if pattern.match(label.name):

				            print(f"Removing label: {label.name}")

				            comment_body += f'- {label.name}\n'

				            pr.remove_from_labels(label)

				    pr.create_issue_comment(comment_body)

				def backport(repo, pr, version, commits, backport_base_branch):

				def backport(repo, pr, version, commits, backport_base_branch, is_collaborator):

				    new_branch_name = f'backport/{pr.number}/to-{version}'

				    backport_pr_title = f'[Backport {version}] {pr.title}'

				    repo_url = f'https://scylladbbot:{github_token}@github.com/{repo.full_name}.git'

				@@ -114,33 +122,51 @@ def backport(repo, pr, version, commits, backport_base_branch):

				            is_draft = False

				            for commit in commits:

				                try:

				                    repo_local.git.cherry_pick(commit, '-m1', '-x')

				                    repo_local.git.cherry_pick(commit, '-x')

				                except GitCommandError as e:

				                    logging.warning(f'Cherry-pick conflict on commit {commit}: {e}')

				                    is_draft = True

				                    repo_local.git.add(A=True)

				                    repo_local.git.cherry_pick('--continue')

				            repo_local.git.push(fork_repo, new_branch_name, force=True)

				            create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,

				                                is_draft=is_draft)

				            # Check if the branch already exists in the remote fork

				            remote_refs = repo_local.git.ls_remote('--heads', fork_repo, new_branch_name)

				            if not remote_refs:

				                # Branch does not exist, create it with a regular push

				                repo_local.git.push(fork_repo, new_branch_name)

				                create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,

				                                    is_draft, is_collaborator)

				            else:

				                logging.info(f"Remote branch {new_branch_name} already exists in fork. Skipping push.")

				        except GitCommandError as e:

				            logging.warning(f"GitCommandError: {e}")

				def with_github_keyword_prefix(repo, pr):

				    pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"

				    match = re.findall(pattern, pr.body, re.IGNORECASE)

				    if not match:

				        print(f'No valid close reference for {pr.number}')

				        comment = f':warning:  @{pr.user.login} PR body does not contain a Fixes reference to an issue '

				        comment += ' and can not be backported\n\n'

				        comment += 'The following labels were removed:\n'

				        create_pr_comment_and_remove_label(pr, comment)

				        return False

				    else:

				    # GitHub issue pattern: #123, scylladb/scylladb#123, or full GitHub URLs

				    github_pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"

				    # JIRA issue pattern: PKG-92 or https://scylladb.atlassian.net/browse/PKG-92

				    jira_pattern = r"(?:fix(?:|es|ed))\s*:?\s*(?:(?:https://scylladb\.atlassian\.net/browse/)?([A-Z]+-\d+))"

				    # Check PR body for GitHub issues

				    github_match = re.findall(github_pattern, pr.body, re.IGNORECASE)

				    # Check PR body for JIRA issues

				    jira_match = re.findall(jira_pattern, pr.body, re.IGNORECASE)

				    match = github_match or jira_match

				    if match:

				        return True

				    for commit in pr.get_commits():

				        github_match = re.findall(github_pattern, commit.commit.message, re.IGNORECASE)

				        jira_match = re.findall(jira_pattern, commit.commit.message, re.IGNORECASE)

				        if github_match or jira_match:

				            print(f'{pr.number} has a valid close reference in commit message {commit.sha}')

				            return True

				    print(f'No valid close reference for {pr.number}')

				    return False

				def main():

				    args = parse_args()

				@@ -161,6 +187,7 @@ def main():

				    scylladbbot_repo = g.get_repo(fork_repo_name)

				    closed_prs = []

				    start_commit = None

				    is_collaborator = True

				    if args.commits:

				        start_commit, end_commit = args.commits.split('..')

				@@ -185,21 +212,33 @@ def main():

				        if not backport_labels:

				            print(f'no backport label: {pr.number}')

				            continue

				        if args.commits and not with_github_keyword_prefix(repo, pr):

				        if not with_github_keyword_prefix(repo, pr) and args.github_event != 'unlabeled':

				            comment = f''':warning:  @{pr.user.login} PR body or PR commits do not contain a Fixes reference to an issue and can not be backported

				            please update PR body with a valid ref to an issue. Then remove `scylladbbot/backport_error` label to re-trigger the backport process

				            '''

				            pr.create_issue_comment(comment)

				            pr.add_to_labels("scylladbbot/backport_error")

				            continue

				        if not repo.private and not scylladbbot_repo.has_in_collaborators(pr.user.login):

				            logging.info(f"Sending an invite to {pr.user.login} to become a collaborator to {scylladbbot_repo.full_name} ")

				            scylladbbot_repo.add_to_collaborators(pr.user.login)

				            comment = f':warning:  @{pr.user.login} you have been added as collaborator to scylladbbot fork '

				            comment += f'Please check your inbox and approve the invitation, once it is done, please add the backport labels again\n'

				            create_pr_comment_and_remove_label(pr, comment)

				            continue

				            comment = f''':warning:  @{pr.user.login} you have been added as collaborator to scylladbbot fork

				            Please check your inbox and approve the invitation, otherwise you will not be able to edit PR branch when needed

				            '''

				            # When a pull request is pending for backport but its author is not yet a collaborator of "scylladbbot",

				            # we attach a "scylladbbot/backport_error" label to the PR.

				            # This prevents the workflow from proceeding with the backport process

				            # until the author has been granted proper permissions

				            # the author should remove the label manually to re-trigger the backport workflow.

				            pr.add_to_labels("scylladbbot/backport_error")

				            pr.create_issue_comment(comment)

				            is_collaborator = False

				        commits = get_pr_commits(repo, pr, stable_branch, start_commit)

				        logging.info(f"Found PR #{pr.number} with commit {commits} and the following labels: {backport_labels}")

				        for backport_label in backport_labels:

				            version = backport_label.replace('backport/', '')

				            backport_base_branch = backport_label.replace('backport/', backport_branch)

				            backport(repo, pr, version, commits, backport_base_branch)

				            backport(repo, pr, version, commits, backport_base_branch, is_collaborator)

				if __name__ == "__main__":

									
										81

.github/scripts/check-license.py
									
										vendored
									
										Executable file
									
												View File
												
				@@ -0,0 +1,81 @@

				#!/usr/bin/env python3

				# -*- coding: utf-8 -*-

				#

				# Copyright (C) 2024-present ScyllaDB

				#

				#

				# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				#

				import argparse

				import sys

				from pathlib import Path

				from typing import Set

				def parse_args() -> argparse.Namespace:

				    """Parses command-line arguments."""

				    parser = argparse.ArgumentParser(description='Check license headers in files')

				    parser.add_argument('--files', required=True, nargs="+", type=Path,

				                        help='List of files to check')

				    parser.add_argument('--license', required=True,

				                        help='License to check for')

				    parser.add_argument('--check-lines', type=int, default=10,

				                        help='Number of lines to check (default: %(default)s)')

				    parser.add_argument('--extensions', required=True, nargs="+",

				                        help='List of file extensions to check')

				    parser.add_argument('--verbose', action='store_true',

				                        help='Print verbose output (default: %(default)s)')

				    return parser.parse_args()

				def should_check_file(file_path: Path, allowed_extensions: Set[str]) -> bool:

				    return file_path.suffix in allowed_extensions

				def check_license_header(file_path: Path, license_header: str, check_lines: int) -> bool:

				    try:

				        with open(file_path, 'r', encoding='utf-8') as f:

				            for _ in range(check_lines):

				                line = f.readline()

				                if license_header in line:

				                    return True

				        return False

				    except (UnicodeDecodeError, StopIteration):

				        # Handle files that can't be read as text or have fewer lines

				        return False

				def main() -> int:

				    args = parse_args()

				    if not args.files:

				        print("No files to check")

				        return 0

				    num_errors = 0

				    for file_path in args.files:

				        # Skip non-existent files

				        if not file_path.exists():

				            continue

				        # Skip files with non-matching extensions

				        if not should_check_file(file_path, args.extensions):

				            print(f"ℹ️ Skipping file with unchecked extension: {file_path}")

				            continue

				        # Check license header

				        if check_license_header(file_path, args.license, args.check_lines):

				            if args.verbose:

				                print(f"✅ License header found in: {file_path}")

				        else:

				            print(f"❌ Missing license header in: {file_path}")

				            num_errors += 1

				    if num_errors > 0:

				        sys.exit(1)

				if __name__ == '__main__':

				    main()

									
										20

.github/scripts/sync_labels.py
									
										vendored
									
												View File
												
				@@ -30,8 +30,13 @@ def copy_labels_from_linked_issues(repo, pr_number):

				            try:

				                issue = repo.get_issue(int(issue_number))

				                for label in issue.labels:

				                    # Copy ALL labels from issues to PR when PR is opened

				                    pr.add_to_labels(label.name)

				                print(f"Labels from issue #{issue_number} copied to PR #{pr_number}")

				                    print(f"Copied label '{label.name}' from issue #{issue_number} to PR #{pr_number}")

				                    if label.name in ['P0', 'P1']:

				                        pr.add_to_labels('force_on_cloud')

				                        print(f"Added force_on_cloud label to PR #{pr_number} due to {label.name} label")

				                print(f"All labels from issue #{issue_number} copied to PR #{pr_number}")

				            except Exception as e:

				                print(f"Error processing issue #{issue_number}: {e}")

				@@ -74,9 +79,22 @@ def sync_labels(repo, number, label, action, is_issue=False):

				            target = repo.get_issue(int(pr_or_issue_number))

				        if action == 'labeled':

				            target.add_to_labels(label)

				            if label in ['P0', 'P1'] and is_issue:

				                # Only add force_on_cloud to PRs when P0/P1 is added to an issue

				                target.add_to_labels('force_on_cloud')

				                print(f"Added 'force_on_cloud' label to PR #{pr_or_issue_number} due to {label} label")

				            print(f"Label '{label}' successfully added.")

				        elif action == 'unlabeled':

				            target.remove_from_labels(label)

				            if label in ['P0', 'P1'] and is_issue:

				                # Check if any other P0/P1 labels remain before removing force_on_cloud

				                remaining_priority_labels = [l.name for l in target.labels if l.name in ['P0', 'P1']]

				                if not remaining_priority_labels:

				                    try:

				                        target.remove_from_labels('force_on_cloud')

				                        print(f"Removed 'force_on_cloud' label from PR #{pr_or_issue_number} as no P0/P1 labels remain")

				                    except Exception as e:

				                        print(f"Warning: Could not remove force_on_cloud label: {e}")

				            print(f"Label '{label}' successfully removed.")

				        elif action == 'opened':

				            copy_labels_from_linked_issues(repo, number)

									
										16

.github/seastar-bad-include.json
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,16 @@

				{

				    "problemMatcher": [

				        {

				            "owner": "seastar-bad-include",

				            "severity": "error",

				            "pattern": [

				                {

				                    "regexp": "^(.+):(\\d+):(.+)$",

				                    "file": 1,

				                    "line": 2,

				                    "message": 3

				                }

				            ]

				        }

				    ]

				}

									
										34

.github/workflows/add-label-when-promoted.yaml
									
										vendored
									
												View File
												
				@@ -7,7 +7,7 @@ on:

				      - branch-*.*

				      - enterprise

				  pull_request_target:

				    types: [labeled]

				    types: [labeled, unlabeled]

				    branches: [master, next, enterprise]

				jobs:

				@@ -53,19 +53,31 @@ jobs:

				        env:

				          GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}

				        run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --commits ${{ github.event.before }}..${{ github.sha }}

				      - name: Check if label starts with 'backport/' and contains digits

				      - name: Check if a valid backport label exists and no backport_error

				        env:

				          LABELS_JSON: ${{ toJson(github.event.pull_request.labels) }}

				        id: check_label

				        run: |

				          label_name="${{ github.event.label.name }}"

				          if [[ "$label_name" =~ ^backport/[0-9]+\.[0-9]+$ ]]; then

				            echo "Label matches backport/X.X pattern."

				            echo "backport_label=true" >> $GITHUB_OUTPUT

				          labels_json="$LABELS_JSON"

				          echo "Checking labels:"

				          echo "$labels_json" | jq -r '.[].name'

				          # Check if a valid backport label exists

				          if echo "$labels_json" | jq -e 'any(.[] | .name; test("backport/[0-9]+\\.[0-9]+$"))' > /dev/null; then

				            # Ensure scylladbbot/backport_error is NOT present

				            if ! echo "$labels_json" | jq -e '.[] | select(.name == "scylladbbot/backport_error")' > /dev/null; then

				              echo "A matching backport label was found and no backport_error label exists."

				              echo "ready_for_backport=true" >> "$GITHUB_OUTPUT"

				              exit 0

				            else

				              echo "The label 'scylladbbot/backport_error' is present, invalidating backport."

				            fi

				          else

				            echo "Label does not match the required pattern."

				            echo "backport_label=false" >> $GITHUB_OUTPUT

				            echo "No matching backport label found."

				          fi

				      - name: Run auto-backport.py when label was added

				        if: ${{ github.event_name == 'pull_request_target' && steps.check_label.outputs.backport_label == 'true' && github.event.pull_request.state == 'closed' }}

				          echo "ready_for_backport=false" >> "$GITHUB_OUTPUT"

				      - name: Run auto-backport.py when PR is closed

				        if: ${{ github.event_name == 'pull_request_target' && steps.check_label.outputs.ready_for_backport == 'true' && github.event.pull_request.state == 'closed' }}

				        env:

				          GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}

				        run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --pull-request ${{ github.event.pull_request.number }} --head-commit ${{ github.event.pull_request.base.sha }}

				        run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --pull-request ${{ github.event.pull_request.number }} --head-commit ${{ github.event.pull_request.base.sha }} --github-event ${{ github.event.action }}

									
										12

.github/workflows/call_jira_status_in_progress.yml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,12 @@

				name: Call Jira Status In Progress

				on:

				  pull_request_target:

				    types: [opened]

				jobs:

				  call-jira-status-in-progress:

				    uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_in_progress.yml@main

				    secrets:

				      caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

									
										12

.github/workflows/call_jira_status_in_review.yml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,12 @@

				name: Call Jira Status In Review

				on:

				  pull_request_target:

				    types: [ready_for_review, review_requested]

				jobs:

				  call-jira-status-in-review:

				    uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_in_review.yml@main

				    secrets:

				      caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

									
										12

.github/workflows/call_jira_status_ready_for_merge.yml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,12 @@

				name: Call Jira Status Ready For Merge

				on:

				  pull_request_target:

				    types: [labeled]

				jobs:

				  call-jira-status-update:

				    uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_ready_for_merge.yml@main

				    secrets:

				      caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

									
										52

.github/workflows/check-license-header.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,52 @@

				name: License Header Check

				on:

				  pull_request:

				    types: [opened, synchronize, reopened]

				    branches: [master]

				env:

				  HEADER_CHECK_LINES: 10

				  LICENSE: "LicenseRef-ScyllaDB-Source-Available-1.0"

				  CHECKED_EXTENSIONS: ".cc .hh .py"

				jobs:

				  check-license-headers:

				    name: Check License Headers

				    runs-on: ubuntu-latest

				    permissions:

				      pull-requests: write

				    steps:

				      - name: Checkout code

				        uses: actions/checkout@v4

				        with:

				          fetch-depth: 0

				      - name: Get changed files

				        id: changed-files

				        run: |

				          # Get list of added files comparing with base branch

				          echo "files=$(git diff --name-only --diff-filter=A ${{ github.event.pull_request.base.sha }} ${{ github.sha }} | tr '\n' ' ')" >> $GITHUB_OUTPUT

				      - name: Check license headers

				        if: steps.changed-files.outputs.files != ''

				        run: |

				          .github/scripts/check-license.py \

				            --files ${{ steps.changed-files.outputs.files }} \

				            --license "${{ env.LICENSE }}" \

				            --check-lines "${{ env.HEADER_CHECK_LINES }}" \

				            --extensions ${{ env.CHECKED_EXTENSIONS }}

				      - name: Comment on PR if check fails

				        if: failure()

				        uses: actions/github-script@v7

				        with:

				          script: |

				            const license = '${{ env.LICENSE }}';

				            await github.rest.issues.createComment({

				              issue_number: context.issue.number,

				              owner: context.repo.owner,

				              repo: context.repo.repo,

				              body: `❌ License header check failed. Please ensure all new files include the header within the first ${{ env.HEADER_CHECK_LINES }} lines:\n\`\`\`\n${license}\n\`\`\`\nSee action logs for details.`

				            });

									
										2

.github/workflows/clang-nightly.yaml
									
										vendored
									
												View File
												
				@@ -7,7 +7,7 @@ on:

				env:

				  # use the development branch explicitly

				  CLANG_VERSION: 20

				  CLANG_VERSION: 21

				  BUILD_DIR: build

				permissions: {}

									
										1

.github/workflows/clang-tidy.yaml
									
										vendored
									
												View File
												
				@@ -34,6 +34,7 @@ jobs:

				    name: Run clang-tidy

				    needs:

				      - read-toolchain

				    if: "${{ needs.read-toolchain.result == 'success' }}"

				    runs-on: ubuntu-latest

				    container: ${{ needs.read-toolchain.outputs.image }}

				    steps:

									
										143

.github/workflows/conflict_reminder.yaml
									
										vendored
									
												View File
												
				@@ -1,9 +1,16 @@

				name: Notify PR Authors of Conflicts

				permissions:

				  issues: write

				  pull-requests: write

				on:

				  push:

				    branches:

				      - 'master'

				      - 'branch-*'

				  schedule:

				    - cron: '0 10 * * 1,4'  # Runs every Monday and Thursday at 10:00am

				  workflow_dispatch:      # Manual trigger for testing

				    - cron: '0 10 * * 1'  # Runs every Monday at 10:00am

				jobs:

				  notify_conflict_prs:

				@@ -14,32 +21,134 @@ jobs:

				        uses: actions/github-script@v7

				        with:

				          script: |

				            console.log("Starting conflict reminder script...");

				            // Print trigger event

				            if (process.env.GITHUB_EVENT_NAME) {

				              console.log(`Workflow triggered by: ${process.env.GITHUB_EVENT_NAME}`);

				            } else {

				              console.log("Could not determine workflow trigger event.");

				            }

				            const isPushEvent = process.env.GITHUB_EVENT_NAME === 'push';

				            console.log(`isPushEvent: ${isPushEvent}`);

				            const twoMonthsAgo = new Date();

				            twoMonthsAgo.setMonth(twoMonthsAgo.getMonth() - 2);

				            const prs = await github.paginate(github.rest.pulls.list, {

				              owner: context.repo.owner,

				              repo: context.repo.repo,

				              state: 'open',

				              per_page: 100

				            });

				            console.log(`Fetched ${prs.length} open PRs`);

				            const recentPrs = prs.filter(pr => new Date(pr.created_at) >= twoMonthsAgo);

				            const validBaseBranches = ['master'];

				            const branchPrefix = 'branch-';

				            const threeDaysAgo = new Date();

				            const conflictLabel = 'conflicts';          

				            threeDaysAgo.setDate(threeDaysAgo.getDate() - 3);

				            for (const pr of prs) {

				              if (!pr.base.ref.startsWith(branchPrefix)) continue;

				              const hasConflictLabel = pr.labels.some(label => label.name === conflictLabel);

				              if (!hasConflictLabel) continue;

				            const oneWeekAgo = new Date();

				            const conflictLabel = 'conflicts';

				            oneWeekAgo.setDate(oneWeekAgo.getDate() - 7);

				            console.log(`One week ago: ${oneWeekAgo.toISOString()}`);

				            for (const pr of recentPrs) {

				              console.log(`Checking PR #${pr.number} on base branch '${pr.base.ref}'`);

				              const isBranchX = pr.base.ref.startsWith(branchPrefix);

				              const isMaster = validBaseBranches.includes(pr.base.ref);

				              if (!(isBranchX || isMaster)) {

				                console.log(`PR #${pr.number} skipped: base branch is not 'master' or does not start with '${branchPrefix}'`);

				                continue;

				              }

				              const updatedDate = new Date(pr.updated_at);

				              if (updatedDate >= threeDaysAgo) continue;

				              if (pr.assignee === null) continue;

				              const assignee = pr.assignee.login;

				              if (assignee) {

				                await github.rest.issues.createComment({

				              console.log(`PR #${pr.number} last updated at: ${updatedDate.toISOString()}`);

				              if (!isPushEvent && updatedDate >= oneWeekAgo) {

				                console.log(`PR #${pr.number} skipped: updated within last week`);

				                continue;

				              }

				              if (pr.assignee === null) {

				                console.log(`PR #${pr.number} skipped: no assignee`);

				                continue;

				              }

				              // Fetch PR details to check mergeability

				              let { data: prDetails } = await github.rest.pulls.get({

				                owner: context.repo.owner,

				                repo: context.repo.repo,

				                pull_number: pr.number,

				              });

				              console.log(`PR #${pr.number} mergeable: ${prDetails.mergeable}`);

				              // Wait and re-fetch if mergeable is null

				              if (prDetails.mergeable === null) {

				                console.log(`PR #${pr.number} mergeable is null, waiting 2 seconds and retrying...`);

				                await new Promise(resolve => setTimeout(resolve, 2000)); // wait 2 seconds

				                prDetails = (await github.rest.pulls.get({

				                  owner: context.repo.owner,

				                  repo: context.repo.repo,

				                  pull_number: pr.number,

				                })).data;

				                console.log(`PR #${pr.number} mergeable after retry: ${prDetails.mergeable}`);

				              }

				              if (prDetails.mergeable === false) {

				                const hasConflictLabel = pr.labels.some(label => label.name === conflictLabel);

				                console.log(`PR #${pr.number} has conflict label: ${hasConflictLabel}`);

				                // Fetch comments to check for existing notifications

				                const comments = await github.paginate(github.rest.issues.listComments, {

				                  owner: context.repo.owner,

				                  repo: context.repo.repo,

				                  issue_number: pr.number,

				                  body: `@${assignee}, this PR has been open with conflicts. Please resolve the conflicts so we can merge it.`,

				                  per_page: 100,

				                });

				                console.log(`Notified @${assignee} for PR #${pr.number}`);

				              } 

				                // Find last notification comment from the bot

				                const notificationPrefix = `@${pr.assignee.login}, this PR has merge conflicts with the base branch.`;

				                const lastNotification = comments

				                  .filter(c =>

				                    c.user.type === "Bot" &&

				                    c.body.startsWith(notificationPrefix)

				                  )

				                  .sort((a, b) => new Date(b.created_at) - new Date(a.created_at))[0];

				                // Check if we should skip notification based on recent notification

				                let shouldSkipNotification = false;

				                if (lastNotification) {

				                  const lastNotified = new Date(lastNotification.created_at);

				                  if (lastNotified >= oneWeekAgo) {

				                    console.log(`PR #${pr.number} skipped: last notification was less than 1 week ago`);

				                    shouldSkipNotification = true;

				                  }

				                }

				                // Additional check for push events on draft PRs with conflict labels

				                if (

				                  isPushEvent &&

				                  pr.draft === true &&

				                  hasConflictLabel &&

				                  shouldSkipNotification

				                ) {

				                  continue;

				                }

				                if (!hasConflictLabel) {

				                  await github.rest.issues.addLabels({

				                    owner: context.repo.owner,

				                    repo: context.repo.repo,

				                    issue_number: pr.number,

				                    labels: [conflictLabel],

				                  });

				                  console.log(`Added 'conflicts' label to PR #${pr.number}`);

				                }

				                const assignee = pr.assignee.login;

				                if (assignee && !shouldSkipNotification) {

				                  await github.rest.issues.createComment({

				                    owner: context.repo.owner,

				                    repo: context.repo.repo,

				                    issue_number: pr.number,

				                    body: `@${assignee}, this PR has merge conflicts with the base branch. Please resolve the conflicts so we can merge it.`,

				                  });

				                  console.log(`Notified @${assignee} for PR #${pr.number}`);

				                }

				              } else {

				                console.log(`PR #${pr.number} is mergeable, no action needed.`);

				              }

				            }

				            console.log(`Total PRs checked: ${prs.length}`);

									
										34

.github/workflows/docs-validate-metrics.yml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,34 @@

				name: Docs / Validate metrics

				on:

				  pull_request:

				    branches:

				      - master

				      - enterprise

				    paths:

				      - '**/*.cc'

				      - 'scripts/metrics-config.yml' 

				      - 'scripts/get_description.py'

				      - 'docs/_ext/scylladb_metrics.py'

				jobs:

				  validate-metrics:

				    runs-on: ubuntu-latest

				    name: Check metrics documentation coverage

				    steps:

				    - name: Checkout code

				      uses: actions/checkout@v4

				      with:

				        submodules: true

				    - name: Set up Python

				      uses: actions/setup-python@v6

				      with:

				        python-version: '3.10'

				    - name: Install dependencies

				      run: pip install PyYAML

				    - name: Validate metrics

				      run: python3 scripts/get_description.py --validate -c scripts/metrics-config.yml

									
										24

.github/workflows/iwyu.yaml
									
										vendored
									
												View File
												
				@@ -11,7 +11,8 @@ env:

				  CLEANER_OUTPUT_PATH: build/clang-include-cleaner.log

				  # the "idl" subdirectory does not contain C++ source code. the .hh files in it are

				  # supposed to be processed by idl-compiler.py, so we don't check them using the cleaner

				  CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation

				  CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service

				  SEASTAR_BAD_INCLUDE_OUTPUT_PATH: build/seastar-bad-include.log

				permissions: {}

				@@ -80,7 +81,24 @@ jobs:

				          done

				      - run: |

				          echo "::remove-matcher owner=clang-include-cleaner::"

				      - run: |

				          echo "::add-matcher::.github/seastar-bad-include.json"

				      - name: check for seastar includes

				        run: |

				          git -c safe.directory="$PWD"    \

				            grep -nE '#include +"seastar/' \

				            | tee "$SEASTAR_BAD_INCLUDE_OUTPUT_PATH"

				      - run: |

				          echo "::remove-matcher owner=seastar-bad-include::"

				      - uses: actions/upload-artifact@v4

				        with:

				          name: Logs (clang-include-cleaner)

				          path: "./${{ env.CLEANER_OUTPUT_PATH }}"

				          name: Logs

				          path: |

				            ${{ env.CLEANER_OUTPUT_PATH }}

				            ${{ env.SEASTAR_BAD_INCLUDE_OUTPUT_PATH }}

				      - name: fail if seastar headers are included as an internal library

				        run: |

				          if [ -s "$SEASTAR_BAD_INCLUDE_OUTPUT_PATH" ]; then

				            echo "::error::Found #include \"seastar/ in the source code. Use angle brackets instead."

				            exit 1

				          fi

									
										29

.github/workflows/make-pr-ready-for-review.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,29 @@

				name: Mark PR as Ready When Conflicts Label is Removed

				on:

				  pull_request_target:

				    types:

				      - unlabeled

				env:

				  DEFAULT_BRANCH: 'master'

				jobs:

				  mark-ready:

				    if: github.event.label.name == 'conflicts'

				    runs-on: ubuntu-latest

				    permissions:

				      pull-requests: write

				    steps:

				      - name: Checkout repository

				        uses: actions/checkout@v4

				        with:

				          repository: ${{ github.repository }}

				          ref: ${{ env.DEFAULT_BRANCH }}

				          token: ${{ secrets.AUTO_BACKPORT_TOKEN }}

				          fetch-depth: 1

				      - name: Mark pull request as ready for review

				        run:  gh pr ready "${{ github.event.pull_request.number }}"

				        env:

				          GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}

									
										4

.github/workflows/pr-require-backport-label.yaml
									
										vendored
									
												View File
												
				@@ -13,10 +13,12 @@ jobs:

				      issues: write

				      pull-requests: write

				    steps:

				      - name: Wait for label to be added

				        run: sleep 1m

				      - uses: mheap/github-action-required-labels@v5

				        with:

				          mode: minimum

				          count: 1

				          labels: "backport/none\nbackport/\\d.\\d"

				          labels: "backport/none\nbackport/\\d{4}\\.\\d+\nbackport/\\d+\\.\\d+"

				          use_regex: true

				          add_comment: false

									
										7

.github/workflows/seastar.yaml
									
										vendored
									
												View File
												
				@@ -15,10 +15,13 @@ env:

				  BUILD_DIR: build

				jobs:

				  read-toolchain:

				    uses: ./.github/workflows/read-toolchain.yaml

				  build-with-the-latest-seastar:

				    needs:

				      - read-toolchain

				    runs-on: ubuntu-latest

				    # be consistent with tools/toolchain/image

				    container: scylladb/scylla-toolchain:fedora-40-20240621

				    container: ${{ needs.read-toolchain.outputs.image }}

				    strategy:

				      matrix:

				        build_type:

									
										4

.github/workflows/sync-labels.yaml
									
										vendored
									
												View File
												
				@@ -37,13 +37,13 @@ jobs:

				        run: python .github/scripts/sync_labels.py --repo ${{ github.repository }} --number ${{ github.event.number }} --action ${{ github.event.action }}

				      - name: Pull request labeled or unlabeled event

				        if: github.event_name == 'pull_request_target' && startsWith(github.event.label.name, 'backport/')

				        if: github.event_name == 'pull_request_target' && (startsWith(github.event.label.name, 'backport/') || github.event.label.name == 'P0' || github.event.label.name == 'P1')

				        env:

				          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

				        run: python .github/scripts/sync_labels.py --repo ${{ github.repository }} --number ${{ github.event.number }} --action ${{ github.event.action }} --label ${{ github.event.label.name }}

				      - name: Issue labeled or unlabeled event

				        if: github.event_name == 'issues' && startsWith(github.event.label.name, 'backport/')

				        if: github.event_name == 'issues' && (startsWith(github.event.label.name, 'backport/') || github.event.label.name == 'P0' || github.event.label.name == 'P1')

				        env:

				          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

				        run: python .github/scripts/sync_labels.py --repo ${{ github.repository }} --number ${{ github.event.issue.number }} --action ${{ github.event.action }} --is_issue --label ${{ github.event.label.name }}

									
										21

.github/workflows/trigger-scylla-ci.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,21 @@

				name: Trigger Scylla CI Route

				on:

				  issue_comment:

				    types: [created]

				jobs:

				  trigger-jenkins:

				    if: github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')

				    runs-on: ubuntu-latest

				    steps:

				      - name: Trigger Scylla-CI-Route Jenkins Job

				        env:

				          JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}

				          JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}

				          JENKINS_URL: "https://jenkins.scylladb.com"

				        run: |

				          PR_NUMBER=${{ github.event.issue.number }}

				          PR_REPO_NAME=${{ github.event.repository.full_name }}

				          curl -X POST "$JENKINS_URL/job/releng/job/Scylla-CI-Route/buildWithParameters?PR_NUMBER=$PR_NUMBER&PR_REPO_NAME=$PR_REPO_NAME" \

				          --user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail -i -v

									
										242

.github/workflows/trigger_ci.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,242 @@

				name: Trigger next gating

				on:

				  pull_request_target:

				    types: [opened, reopened, synchronize]

				  issue_comment:

				    types: [created]

				jobs:

				  trigger-ci:

				    runs-on: ubuntu-latest

				    steps:

				      - name: Dump GitHub context

				        env:

				          GITHUB_CONTEXT: ${{ toJson(github) }}

				        run: echo "$GITHUB_CONTEXT"

				      - name: Checkout PR code

				        uses: actions/checkout@v3

				        with:

				          fetch-depth: 0  # Needed to access full history

				          ref: ${{ github.event.pull_request.head.ref }}

				      - name: Fetch before commit if needed

				        run: |

				          if ! git cat-file -e ${{ github.event.before }} 2>/dev/null; then

				            echo "Fetching before commit ${{ github.event.before }}"

				            git fetch --depth=1 origin ${{ github.event.before }}

				          fi

				      - name: Compare commits for file changes

				        if: github.action == 'synchronize'

				        env:

				          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

				        run: |

				          echo "Base: ${{ github.event.before }}"

				          echo "Head: ${{ github.event.after }}"

				          TREE_BEFORE=$(git show -s --format=%T ${{ github.event.before }})

				          TREE_AFTER=$(git show -s --format=%T ${{ github.event.after }})

				          echo "TREE_BEFORE=$TREE_BEFORE" >> $GITHUB_ENV

				          echo "TREE_AFTER=$TREE_AFTER" >> $GITHUB_ENV

				      - name: Check if last push has file changes

				        run: |

				          if [[ "${{ env.TREE_BEFORE }}" == "${{ env.TREE_AFTER }}" ]]; then

				            echo "No file changes detected in the last push, only commit message edit."

				            echo "has_file_changes=false" >> $GITHUB_ENV

				          else

				            echo "File changes detected in the last push."

				            echo "has_file_changes=true" >> $GITHUB_ENV

				          fi

				      - name: Rule 1 - Check PR draft or conflict status

				        run: |

				          # Check if PR is in draft mode

				          IS_DRAFT="${{ github.event.pull_request.draft }}"

				          # Check if PR has 'conflict' label

				          HAS_CONFLICT_LABEL="false"

				          LABELS='${{ toJson(github.event.pull_request.labels) }}'

				          if echo "$LABELS" | jq -r '.[].name' | grep -q "^conflict$"; then

				            HAS_CONFLICT_LABEL="true"

				          fi

				          # Set draft_or_conflict variable

				          if [[ "$IS_DRAFT" == "true" || "$HAS_CONFLICT_LABEL" == "true" ]]; then

				            echo "draft_or_conflict=true" >> $GITHUB_ENV

				            echo "✅ Rule 1: PR is in draft mode or has conflict label - setting draft_or_conflict=true"

				          else

				            echo "draft_or_conflict=false" >> $GITHUB_ENV

				            echo "✅ Rule 1: PR is ready and has no conflict label - setting draft_or_conflict=false"

				          fi

				          echo "Draft status: $IS_DRAFT"

				          echo "Has conflict label: $HAS_CONFLICT_LABEL"

				          echo "Result: draft_or_conflict = $draft_or_conflict"

				      - name: Rule 2 - Check labels

				        run: |

				          # Check if PR has P0 or P1 labels

				          HAS_P0_P1_LABEL="false"

				          LABELS='${{ toJson(github.event.pull_request.labels) }}'

				          if echo "$LABELS" | jq -r '.[].name' | grep -E "^(P0|P1)$" > /dev/null; then

				            HAS_P0_P1_LABEL="true"

				          fi

				          # Check if PR already has force_on_cloud label

				          echo "HAS_FORCE_ON_CLOUD_LABEL=false" >> $GITHUB_ENV

				          if echo "$LABELS" | jq -r '.[].name' | grep -q "^force_on_cloud$"; then

				            HAS_FORCE_ON_CLOUD_LABEL="true"

				            echo "HAS_FORCE_ON_CLOUD_LABEL=true" >> $GITHUB_ENV

				          fi

				          echo "Has P0/P1 label: $HAS_P0_P1_LABEL"

				          echo "Has force_on_cloud label: $HAS_FORCE_ON_CLOUD_LABEL"

				          # Add force_on_cloud label if PR has P0/P1 and doesn't already have force_on_cloud

				          if [[ "$HAS_P0_P1_LABEL" == "true" && "$HAS_FORCE_ON_CLOUD_LABEL" == "false" ]]; then

				            echo "✅ Rule 2: PR has P0 or P1 label - adding force_on_cloud label"

				            curl -X POST \

				              -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \

				              -H "Accept: application/vnd.github.v3+json" \

				              "https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.pull_request.number }}/labels" \

				              -d '{"labels":["force_on_cloud"]}'

				          elif [[ "$HAS_P0_P1_LABEL" == "true" && "$HAS_FORCE_ON_CLOUD_LABEL" == "true" ]]; then

				            echo "✅ Rule 2: PR has P0 or P1 label and already has force_on_cloud label - no action needed"

				          else

				            echo "✅ Rule 2: PR does not have P0 or P1 label - no force_on_cloud label needed"

				          fi

				          SKIP_UNIT_TEST_CUSTOM="false"

				          if echo "$LABELS" | jq -r '.[].name' | grep -q "^ci/skip_unit-tests_custom$"; then

				            SKIP_UNIT_TEST_CUSTOM="true"

				          fi

				          echo "SKIP_UNIT_TEST_CUSTOM=$SKIP_UNIT_TEST_CUSTOM" >> $GITHUB_ENV

				      - name: Rule 3 - Analyze changed files and set build requirements

				        run: |

				          # Get list of changed files

				          CHANGED_FILES=$(git diff --name-only ${{ github.event.pull_request.base.sha }} ${{ github.event.pull_request.head.sha }})

				          echo "Changed files:"

				          echo "$CHANGED_FILES"

				          echo ""

				          # Initialize all requirements to false

				          REQUIRE_BUILD="false"

				          REQUIRE_DTEST="false"

				          REQUIRE_UNITTEST="false"

				          REQUIRE_ARTIFACTS="false"

				          REQUIRE_SCYLLA_GDB="false"

				          # Check each file against patterns

				          while IFS= read -r file; do

				            if [[ -n "$file" ]]; then

				              echo "Checking file: $file"

				              # Build pattern: ^(?!scripts\/pull_github_pr.sh).*$

				              # Everything except scripts/pull_github_pr.sh

				              if [[ "$file" != "scripts/pull_github_pr.sh" ]]; then

				                REQUIRE_BUILD="true"

				                echo "  ✓ Matches build pattern"

				              fi

				              # Dtest pattern: ^(?!test(.py|\/)|dist\/docker\/|dist\/common\/scripts\/).*$

				              # Everything except test files, dist/docker/, dist/common/scripts/

				              if [[ ! "$file" =~ ^test\.(py|/).*$ ]] && [[ ! "$file" =~ ^dist/docker/.*$ ]] && [[ ! "$file" =~ ^dist/common/scripts/.*$ ]]; then

				                REQUIRE_DTEST="true"

				                echo "  ✓ Matches dtest pattern"

				              fi

				              # Unittest pattern: ^(?!dist\/docker\/|dist\/common\/scripts).*$

				              # Everything except dist/docker/, dist/common/scripts/

				              if [[ ! "$file" =~ ^dist/docker/.*$ ]] && [[ ! "$file" =~ ^dist/common/scripts.*$ ]]; then

				                REQUIRE_UNITTEST="true"

				                echo "  ✓ Matches unittest pattern"

				              fi

				              # Artifacts pattern: ^(?:dist|tools\/toolchain).*$

				              # Files starting with dist or tools/toolchain

				              if [[ "$file" =~ ^dist.*$ ]] || [[ "$file" =~ ^tools/toolchain.*$ ]]; then

				                REQUIRE_ARTIFACTS="true"

				                echo "  ✓ Matches artifacts pattern"

				              fi

				              # Scylla GDB pattern: ^(scylla-gdb.py).*$

				              # Files starting with scylla-gdb.py

				              if [[ "$file" =~ ^scylla-gdb\.py.*$ ]]; then

				                REQUIRE_SCYLLA_GDB="true"

				                echo "  ✓ Matches scylla_gdb pattern"

				              fi

				            fi

				          done <<< "$CHANGED_FILES"

				          # Set environment variables

				          echo "requireBuild=$REQUIRE_BUILD" >> $GITHUB_ENV

				          echo "requireDtest=$REQUIRE_DTEST" >> $GITHUB_ENV

				          echo "requireUnittest=$REQUIRE_UNITTEST" >> $GITHUB_ENV

				          echo "requireArtifacts=$REQUIRE_ARTIFACTS" >> $GITHUB_ENV

				          echo "requireScyllaGdb=$REQUIRE_SCYLLA_GDB" >> $GITHUB_ENV

				          echo ""

				          echo "✅ Rule 3: File analysis complete"

				          echo "Build required: $REQUIRE_BUILD"

				          echo "Dtest required: $REQUIRE_DTEST"

				          echo "Unittest required: $REQUIRE_UNITTEST"

				          echo "Artifacts required: $REQUIRE_ARTIFACTS"

				          echo "Scylla GDB required: $REQUIRE_SCYLLA_GDB"

				      - name: Determine Jenkins Job Name

				        run: |

				          if [[ "${{ github.ref_name }}" == "next" ]]; then

				            FOLDER_NAME="scylla-master"

				          elif [[ "${{ github.ref_name }}" == "next-enterprise" ]]; then

				            FOLDER_NAME="scylla-enterprise"

				          else

				            VERSION=$(echo "${{ github.ref_name }}" | awk -F'-' '{print $2}')

				            if [[ "$VERSION" =~ ^202[0-4]\.[0-9]+$ ]]; then

				              FOLDER_NAME="enterprise-$VERSION"

				            elif [[ "$VERSION" =~ ^[0-9]+\.[0-9]+$ ]]; then

				              FOLDER_NAME="scylla-$VERSION"

				            fi

				          fi

				          echo "JOB_NAME=${FOLDER_NAME}/job/scylla-ci" >> $GITHUB_ENV

				      - name: Trigger Jenkins Job

				        if: env.draft_or_conflict == 'false' && env.has_file_changes == 'true' && github.action == 'opened' || github.action == 'reopened'

				        env:

				          JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}

				          JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}

				          JENKINS_URL: "https://jenkins.scylladb.com"

				          SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}

				        run: |

				          PR_NUMBER=${{ github.event.issue.number }}

				          PR_REPO_NAME=${{ github.event.repository.full_name }}

				          echo "Triggering Jenkins Job: $JOB_NAME"

				          curl -X POST \

				            "$JENKINS_URL/job/$JOB_NAME/buildWithParameters? \

				            PR_NUMBER=$PR_NUMBER& \

				            RUN_DTEST=$REQUIRE_DTEST& \

				            RUN_ONLY_SCYLLA_GDB=$REQUIRE_SCYLLA_GDB& \

				            RUN_UNIT_TEST=$REQUIRE_UNITTEST& \

				            FORCE_ON_CLOUD=$HAS_FORCE_ON_CLOUD_LABEL& \

				            SKIP_UNIT_TEST_CUSTOM=$SKIP_UNIT_TEST_CUSTOM& \

				            RUN_ARTIFACT_TESTS=$REQUIRE_ARTIFACTS" \

				            --fail \

				            --user "$JENKINS_USER:$JENKINS_API_TOKEN" \

				            -i -v

				  trigger-ci-via-comment:

				    if: github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')

				    runs-on: ubuntu-latest

				    steps:

				      - name: Trigger Scylla-CI Jenkins Job

				        env:

				          JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}

				          JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}

				          JENKINS_URL: "https://jenkins.scylladb.com"

				        run: |

				          PR_NUMBER=${{ github.event.issue.number }}

				          PR_REPO_NAME=${{ github.event.repository.full_name }}

				          curl -X POST "$JENKINS_URL/job/$JOB_NAME/buildWithParameters?PR_NUMBER=$PR_NUMBER" \

				          --user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail -i -v

									
										50

.github/workflows/trigger_jenkins.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,50 @@

				name: Trigger next gating

				on:

				  push:

				    branches:

				      - next**

				jobs:

				  trigger-jenkins:

				    runs-on: ubuntu-latest

				    steps:

				      - name: Determine Jenkins Job Name

				        run: |

				          if [[ "${{ github.ref_name }}" == "next" ]]; then

				            FOLDER_NAME="scylla-master"

				          elif [[ "${{ github.ref_name }}" == "next-enterprise" ]]; then

				            FOLDER_NAME="scylla-enterprise"

				          else

				            VERSION=$(echo "${{ github.ref_name }}" | awk -F'-' '{print $2}')

				            if [[ "$VERSION" =~ ^202[0-4]\.[0-9]+$ ]]; then

				              FOLDER_NAME="enterprise-$VERSION"

				            elif [[ "$VERSION" =~ ^[0-9]+\.[0-9]+$ ]]; then

				              FOLDER_NAME="scylla-$VERSION"

				            fi

				          fi

				          echo "JOB_NAME=${FOLDER_NAME}/job/next" >> $GITHUB_ENV

				      - name: Trigger Jenkins Job

				        env:

				          JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}

				          JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}

				          JENKINS_URL: "https://jenkins.scylladb.com"

				          SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}

				        run: |

				          echo "Triggering Jenkins Job: $JOB_NAME"

				          if ! curl -X POST "$JENKINS_URL/job/$JOB_NAME/buildWithParameters" --fail --user "$JENKINS_USER:$JENKINS_API_TOKEN" -i -v; then

				            echo "Error: Jenkins job trigger failed"

				            # Send Slack message

				            curl -X POST -H 'Content-type: application/json' \

				              -H "Authorization: Bearer $SLACK_BOT_TOKEN" \

				              --data '{

				                "channel": "#releng-team",

				                "text": "🚨 @here '$JOB_NAME' failed to be triggered, please check https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }} for more details",

				                "icon_emoji": ":warning:"

				              }' \

				              https://slack.com/api/chat.postMessage

				            exit 1

				          fi

									
										58

.github/workflows/urgent_issue_reminder.yml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,58 @@

				name: Urgent Issue Reminder

				on:

				  schedule:

				    - cron: '10 8 * * *' # Runs daily at 8 AM

				jobs:

				  reminder:

				    runs-on: ubuntu-latest

				    steps:

				    - name: Send reminders

				      uses: actions/github-script@v7

				      with:

				        script: |

				          const labelFilters = ['P0', 'P1', 'Field-Tier1','status/release blocker', 'status/regression']; 

				          const excludingLabelFilters = ['documentation'];

				          const daysInactive = 7;

				          const now = new Date();

				          // Fetch open issues

				          const issues = await github.rest.issues.listForRepo({

				            owner: context.repo.owner,

				            repo: context.repo.repo,

				            state: 'open'

				          });

				          console.log("Looking for issues with labels:"+labelFilters+", excluding labels:"+excludingLabelFilters+ ", inactive for more than "+daysInactive+" days.");

				          for (const issue of issues.data) {

				            // Check if issue has any of the specified labels

				            const hasFilteredLabel = issue.labels.some(label => labelFilters.includes(label.name));

				            const hasExcludingLabel = issue.labels.some(label => excludingLabelFilters.includes(label.name));

				            if (hasExcludingLabel) continue;

				            if (!hasFilteredLabel) continue;

				            // Check for inactivity

				            const lastUpdated = new Date(issue.updated_at);

				            const diffInDays = (now - lastUpdated) / (1000 * 60 * 60 * 24);

				            console.log("Issue #"+issue.number+"; Days inactive:"+diffInDays);

				            if (diffInDays > daysInactive) {

				              if (issue.assignees.length > 0) {

				                console.log("==>> Alert about issue #"+issue.number);

				                const assigneesLogins = issue.assignees.map(assignee => `@${assignee.login}`).join(', ');

				                await github.rest.issues.createComment({

				                  owner: context.repo.owner,

				                  repo: context.repo.repo,

				                  issue_number: issue.number,

				                  body: `${assigneesLogins}, This urgent issue had no activity for more than ${daysInactive} days. Please check its status.\n CC @mykaul @dani-tweig`

				                });

				              } else {

				                await github.rest.issues.createComment({

				                  owner: context.repo.owner,

				                  repo: context.repo.repo,

				                  issue_number: issue.number,

				                  body: `This urgent issue had no activity for more than ${daysInactive} days. Please check its status.\n CC @mykaul @dani-tweig`

				                });

				              }

				            }

				          }

2

.gitignore vendored

View File

@@ -35,3 +35,5 @@ compile_commands.json
 .envrc
 clang_build
 .idea/
 nuke
 rust/target

3

.gitmodules vendored

View File

@@ -9,9 +9,6 @@
 [submodule "abseil"]
 	path = abseil
 	url = ../abseil-cpp
 [submodule "scylla-tools"]
 	path = tools/java
 	url = ../scylla-tools-java
 [submodule "scylla-python3"]
 	path = tools/python3
 	url = ../scylla-python3

									
										141

CMakeLists.txt
									
												View File
												
				@@ -22,6 +22,8 @@ if(DEFINED CMAKE_BUILD_TYPE)

				    endif()

				endif(DEFINED CMAKE_BUILD_TYPE)

				option(Scylla_ENABLE_LTO "Turn on link-time optimization for the 'release' mode." ON)

				include(mode.common)

				get_property(is_multi_config GLOBAL PROPERTY GENERATOR_IS_MULTI_CONFIG)

				if(is_multi_config)

				@@ -42,11 +44,12 @@ else()

				endif()

				include(limit_jobs)

				# Configure Seastar compile options to align with Scylla

				set(CMAKE_CXX_STANDARD "23" CACHE INTERNAL "")

				set(CMAKE_CXX_EXTENSIONS ON CACHE INTERNAL "")

				set(CMAKE_CXX_SCAN_FOR_MODULES OFF CACHE INTERNAL "")

				set(CMAKE_CXX_VISIBILITY_PRESET hidden)

				set(CMAKE_VISIBILITY_INLINES_HIDDEN ON)

				if(is_multi_config)

				    find_package(Seastar)

				@@ -63,36 +66,37 @@ if(is_multi_config)

				    #   establishing proper dependencies between them

				    include(ExternalProject)

				    # should be consistent with configure_seastar() in configure.py

				    set(seastar_build_dir "${CMAKE_BINARY_DIR}/$<CONFIG>/seastar")

				    ExternalProject_Add(Seastar

				        SOURCE_DIR "${PROJECT_SOURCE_DIR}/seastar"

				        BINARY_DIR "${CMAKE_BINARY_DIR}/$<CONFIG>/seastar"

				        CONFIGURE_COMMAND ""

				        BUILD_COMMAND ${CMAKE_COMMAND} --build <BINARY_DIR>

				        BUILD_COMMAND ${CMAKE_COMMAND} --build "${seastar_build_dir}"

				          --target seastar

				          --target seastar_testing

				          --target seastar_perf_testing

				          --target app_iotune

				        BUILD_ALWAYS ON

				        BUILD_BYPRODUCTS

				          <BINARY_DIR>/libseastar.$<IF:$<CONFIG:Debug,Dev>,so,a>

				          <BINARY_DIR>/libseastar_testing.$<IF:$<CONFIG:Debug,Dev>,so,a>

				          <BINARY_DIR>/libseastar_perf_testing.$<IF:$<CONFIG:Debug,Dev>,so,a>

				          <BINARY_DIR>/apps/iotune/iotune

				          <BINARY_DIR>/gen/include/seastar/http/chunk_parsers.hh

				          <BINARY_DIR>/gen/include/seastar/http/request_parser.hh

				          <BINARY_DIR>/gen/include/seastar/http/response_parser.hh

				          ${seastar_build_dir}/libseastar.$<IF:$<CONFIG:Debug,Dev>,so,a>

				          ${seastar_build_dir}/libseastar_testing.$<IF:$<CONFIG:Debug,Dev>,so,a>

				          ${seastar_build_dir}/libseastar_perf_testing.$<IF:$<CONFIG:Debug,Dev>,so,a>

				          ${seastar_build_dir}/apps/iotune/iotune

				          ${seastar_build_dir}/gen/include/seastar/http/chunk_parsers.hh

				          ${seastar_build_dir}/gen/include/seastar/http/request_parser.hh

				          ${seastar_build_dir}/gen/include/seastar/http/response_parser.hh

				        INSTALL_COMMAND "")

				    add_dependencies(Seastar::seastar Seastar)

				    add_dependencies(Seastar::seastar_testing Seastar)

				else()

				    set(Seastar_TESTING ON CACHE BOOL "" FORCE)

				    set(Seastar_API_LEVEL 7 CACHE STRING "" FORCE)

				    set(Seastar_API_LEVEL 9 CACHE STRING "" FORCE)

				    set(Seastar_DEPRECATED_OSTREAM_FORMATTERS OFF CACHE BOOL "" FORCE)

				    set(Seastar_APPS ON CACHE BOOL "" FORCE)

				    set(Seastar_EXCLUDE_APPS_FROM_ALL ON CACHE BOOL "" FORCE)

				    set(Seastar_EXCLUDE_TESTS_FROM_ALL ON CACHE BOOL "" FORCE)

				    set(Seastar_IO_URING ON CACHE BOOL "" FORCE)

				    set(Seastar_SCHEDULING_GROUPS_COUNT 16 CACHE STRING "" FORCE)

				    set(Seastar_SCHEDULING_GROUPS_COUNT 21 CACHE STRING "" FORCE)

				    set(Seastar_UNUSED_RESULT_ERROR ON CACHE BOOL "" FORCE)

				    add_subdirectory(seastar)

				    target_compile_definitions (seastar

				@@ -102,13 +106,18 @@ endif()

				set(ABSL_PROPAGATE_CXX_STD ON CACHE BOOL "" FORCE)

				if(Scylla_ENABLE_LTO)

				    list(APPEND absl_cxx_flags $<$<CONFIG:RelWithDebInfo>:${CMAKE_CXX_COMPILE_OPTIONS_IPO};-ffat-lto-objects>)

				endif()

				find_package(Sanitizers QUIET)

				set(sanitizer_cxx_flags

				list(APPEND absl_cxx_flags

				    $<$<CONFIG:Debug,Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_COMPILE_OPTIONS>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_COMPILE_OPTIONS>>)

				if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")

				    set(ABSL_GCC_FLAGS ${sanitizer_cxx_flags})

				    list(APPEND ABSL_GCC_FLAGS ${absl_cxx_flags})

				elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")

				    set(ABSL_LLVM_FLAGS ${sanitizer_cxx_flags})

				    list(APPEND absl_cxx_flags "-Wno-deprecated-builtins")

				    list(APPEND ABSL_LLVM_FLAGS ${absl_cxx_flags})

				endif()

				set(ABSL_DEFAULT_LINKOPTS

				    $<$<CONFIG:Debug,Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_LINK_LIBRARIES>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_LINK_LIBRARIES>>)

				@@ -141,11 +150,13 @@ find_package(ICU COMPONENTS uc i18n REQUIRED)

				find_package(fmt 10.0.0 REQUIRED)

				find_package(libdeflate REQUIRED)

				find_package(libxcrypt REQUIRED)

				find_package(p11-kit REQUIRED)

				find_package(Snappy REQUIRED)

				find_package(RapidJSON REQUIRED)

				find_package(xxHash REQUIRED)

				find_package(yaml-cpp REQUIRED)

				find_package(zstd REQUIRED)

				find_package(lz4 REQUIRED)

				set(scylla_gen_build_dir "${CMAKE_BINARY_DIR}/gen")

				file(MAKE_DIRECTORY "${scylla_gen_build_dir}")

				@@ -153,47 +164,68 @@ file(MAKE_DIRECTORY "${scylla_gen_build_dir}")

				include(add_version_library)

				generate_scylla_version()

				add_library(scylla-zstd STATIC

				    zstd.cc)

				target_link_libraries(scylla-zstd

				  PRIVATE

				    db

				option(Scylla_USE_PRECOMPILED_HEADER "Use precompiled header for Scylla" ON)

				add_library(scylla-precompiled-header STATIC exported_templates.cc)

				target_link_libraries(scylla-precompiled-header PRIVATE

				    absl::headers

				    absl::btree

				    absl::hash

				    absl::raw_hash_set

				    Seastar::seastar

				    zstd::libzstd)

				    Snappy::snappy

				    systemd

				    ZLIB::ZLIB

				    lz4::lz4_static

				    zstd::zstd_static)

				if (Scylla_USE_PRECOMPILED_HEADER)

				  set(Scylla_USE_PRECOMPILED_HEADER_USE ON)

				  find_program(DISTCC_EXEC NAMES distcc OPTIONAL)

				  if (DISTCC_EXEC)

				    if(DEFINED ENV{DISTCC_HOSTS})

				      set(Scylla_USE_PRECOMPILED_HEADER_USE OFF)

				      message(STATUS "Disabling precompiled header usage because distcc exists and DISTCC_HOSTS is set, assuming you're using distributed compilation.")

				    else()

				      file(REAL_PATH "~/.distcc/hosts" DIST_CC_HOSTS_PATH EXPAND_TILDE)

				      if (EXISTS ${DIST_CC_HOSTS_PATH})

				        set(Scylla_USE_PRECOMPILED_HEADER_USE OFF)

				        message(STATUS "Disabling precompiled header usage because distcc and ~/.distcc/hosts exists, assuming you're using distributed compilation.")

				      endif()

				    endif()

				  endif()

				  if (Scylla_USE_PRECOMPILED_HEADER_USE)

				    message(STATUS "Using precompiled header for Scylla - remember to add `sloppiness = pch_defines,time_macros` to ccache.conf, if you're using ccache.")

				    target_precompile_headers(scylla-precompiled-header PRIVATE "stdafx.hh")

				    target_compile_definitions(scylla-precompiled-header PRIVATE SCYLLA_USE_PRECOMPILED_HEADER)

				  endif()

				else()

				  set(Scylla_USE_PRECOMPILED_HEADER_USE OFF)

				endif()

				add_library(scylla-main STATIC)

				target_sources(scylla-main

				  PRIVATE

				    absl-flat_hash_map.cc

				    bytes.cc

				    client_data.cc

				    clocks-impl.cc

				    collection_mutation.cc

				    compress.cc

				    converting_mutation_partition_applier.cc

				    counters.cc

				    direct_failure_detector/failure_detector.cc

				    duration.cc

				    sstable_dict_autotrainer.cc

				    exceptions/exceptions.cc

				    frozen_schema.cc

				    generic_server.cc

				    debug.cc

				    init.cc

				    keys.cc

				    multishard_mutation_query.cc

				    keys/keys.cc

				    mutation_query.cc

				    node_ops/task_manager_module.cc

				    partition_slice_builder.cc

				    querier.cc

				    query.cc

				    query/query.cc

				    query_ranges_to_vnodes.cc

				    query-result-set.cc

				    query/query-result-set.cc

				    tombstone_gc_options.cc

				    tombstone_gc.cc

				    reader_concurrency_semaphore.cc

				    row_cache.cc

				    schema_mutations.cc

				    reader_concurrency_semaphore_group.cc

				    serializer.cc

				    service/direct_failure_detector/failure_detector.cc

				    sstables_loader.cc

				    table_helper.cc

				    tasks/task_handler.cc

				@@ -204,7 +236,6 @@ target_sources(scylla-main

				    vint-serialization.cc)

				target_link_libraries(scylla-main

				  PRIVATE

				    "$<LINK_LIBRARY:WHOLE_ARCHIVE,scylla-zstd>"

				    db

				    absl::headers

				    absl::btree

				@@ -213,7 +244,11 @@ target_link_libraries(scylla-main

				    Seastar::seastar

				    Snappy::snappy

				    systemd

				    ZLIB::ZLIB)

				    ZLIB::ZLIB

				    lz4::lz4_static

				    zstd::zstd_static

				    scylla-precompiled-header

				)

				option(Scylla_CHECK_HEADERS

				  "Add check-headers target for checking the self-containness of headers")

				@@ -248,6 +283,7 @@ add_custom_target(compiler-training)

				add_subdirectory(api)

				add_subdirectory(alternator)

				add_subdirectory(audit)

				add_subdirectory(db)

				add_subdirectory(auth)

				add_subdirectory(cdc)

				@@ -255,6 +291,7 @@ add_subdirectory(compaction)

				add_subdirectory(cql3)

				add_subdirectory(data_dictionary)

				add_subdirectory(dht)

				add_subdirectory(ent)

				add_subdirectory(gms)

				add_subdirectory(idl)

				add_subdirectory(index)

				@@ -265,7 +302,6 @@ add_subdirectory(mutation)

				add_subdirectory(mutation_writer)

				add_subdirectory(node_ops)

				add_subdirectory(readers)

				add_subdirectory(redis)

				add_subdirectory(replica)

				add_subdirectory(raft)

				add_subdirectory(repair)

				@@ -280,12 +316,14 @@ add_subdirectory(tracing)

				add_subdirectory(transport)

				add_subdirectory(types)

				add_subdirectory(utils)

				add_subdirectory(vector_search)

				add_version_library(scylla_version

				    release.cc)

				add_executable(scylla

				  main.cc)

				target_link_libraries(scylla PRIVATE

				set(scylla_libs

				    audit

				    scylla-main

				    api

				    auth

				@@ -296,17 +334,18 @@ target_link_libraries(scylla PRIVATE

				    cql3

				    data_dictionary

				    dht

				    encryption

				    gms

				    idl

				    index

				    lang

				    ldap

				    locator

				    message

				    mutation

				    mutation_writer

				    raft

				    readers

				    redis

				    repair

				    replica

				    schema

				@@ -319,9 +358,20 @@ target_link_libraries(scylla PRIVATE

				    tracing

				    transport

				    types

				    utils)

				    utils

				    vector_search)

				target_link_libraries(scylla PRIVATE

				    ${scylla_libs})

				if(Scylla_ENABLE_LTO)

				  include(enable_lto)

				  foreach(target scylla ${scylla_libs})

				    enable_lto(${target})

				  endforeach()

				endif()

				target_link_libraries(scylla PRIVATE

				    p11-kit::p11-kit

				    Seastar::seastar

				    absl::headers

				    yaml-cpp::yaml-cpp

				@@ -339,3 +389,10 @@ add_dependencies(compiler-training

				if(Scylla_DIST)

				  add_subdirectory(dist)

				endif()

				if(Scylla_BUILD_INSTRUMENTED)

				  add_subdirectory(pgo)

				endif()

				add_executable(patchelf

				  tools/patchelf.cc)

									
										2

CONTRIBUTING.md
									
												View File
												
				@@ -12,7 +12,7 @@ Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to re

				## Contributing code to Scylla

				Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).

				Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form to cla@scylladb.com. You can then submit your changes as patches to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).

				If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).

				The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.

									
										69

HACKING.md
									
												View File
												
				@@ -43,7 +43,7 @@ $ ./tools/toolchain/dbuild ninja build/release/scylla

				$ ./tools/toolchain/dbuild ./build/release/scylla --developer-mode 1

				```

				Note: do not mix environemtns - either perform all your work with dbuild, or natively on the host.

				Note: do not mix environments - either perform all your work with dbuild, or natively on the host.

				Note2: you can get to an interactive shell within dbuild by running it without any parameters:

				```bash

				$ ./tools/toolchain/dbuild

				@@ -91,7 +91,7 @@ You can also specify a single mode. For example

				$ ninja-build release

				```

				Will build everytihng in release mode. The valid modes are

				Will build everything in release mode. The valid modes are

				* Debug: Enables [AddressSanitizer](https://github.com/google/sanitizers/wiki/AddressSanitizer)

				  and other sanity checks. It has no optimizations, which allows for debugging with tools like

				@@ -220,28 +220,9 @@ On a development machine, one might run Scylla as

				$ SCYLLA_HOME=$HOME/scylla build/release/scylla --overprovisioned --developer-mode=yes

				```

				To interact with scylla it is recommended to build our versions of

				cqlsh and nodetool. They are available at

				https://github.com/scylladb/scylla-tools-java and can be built with

				```bash

				$ sudo ./install-dependencies.sh

				$ ant jar

				```

				cqlsh should work out of the box, but nodetool depends on a running

				scylla-jmx (https://github.com/scylladb/scylla-jmx). It can be build

				with

				```bash

				$ mvn package

				```

				and must be started with

				```bash

				$ ./scripts/scylla-jmx

				```

				To interact with scylla it is recommended to build our version of

				cqlsh. It is available at

				https://github.com/scylladb/scylla-cqlsh and is available as a submodule.

				### Branches and tags

				@@ -280,21 +261,45 @@ Once the patch set is ready to be reviewed, push the branch to the public remote

				### Development environment and source code navigation

				Scylla includes a [CMake](https://cmake.org/) file, `CMakeLists.txt`, for use only with development environments (not for building) so that they can properly analyze the source code.

				Scylla includes a [CMake](https://cmake.org/) file, `CMakeLists.txt` that can be used with development environments so

				that they can properly analyze the source code. However, building with CMake is not yet officially supported.

				[CLion](https://www.jetbrains.com/clion/) is a commercial IDE offers reasonably good source code navigation and advice for code hygiene, though its C++ parser sometimes makes errors and flags false issues.

				Good IDEs that have support for CMake build toolchain are [CLion](https://www.jetbrains.com/clion/),

				[KDevelop](https://www.kdevelop.org/) and [QtCreator](https://wiki.qt.io/Qt_Creator).

				Other good options that directly parse CMake files are [KDevelop](https://www.kdevelop.org/) and [QtCreator](https://wiki.qt.io/Qt_Creator).

				[Eclipse](https://eclipse.org/cdt/) is another open-source option. It doesn't natively work with CMake projects and its

				C++ parser has many issues.

				To use the `CMakeLists.txt` file with these programs, define the `FOR_IDE` CMake variable or shell environmental variable.

				#### CLion

				[Eclipse](https://eclipse.org/cdt/) is another open-source option. It doesn't natively work with CMake projects, and its C++ parser has many similar issues as CLion.

				[CLion](https://www.jetbrains.com/clion/) is a commercial IDE offers reasonably good source code navigation and advice

				for code hygiene, though its C++ parser sometimes makes errors and flags false issues. In order to enable proper code

				analysis in CLion, the following steps are needed:

				1. Get the ScyllaDB source code by following the [Getting the source code](#getting-the-source-code).

				2. Follow the steps in [Dependencies](#dependencies) in order to install the required tools natively into your system.

				   **Don't** follow the *frozen toolchain* part described there, since CMake checks for the build dependencies installed

				   in the system, not in the container image provided by the toolchain.

				3. In CLion, select `File`→`Open` and select the main ScyllaDB directory in order to open the CMake project there. The

				   project should open and fail to process the `CMakeLists.txt`. That's expected.

				4. In CLion, open `File`→`Settings`.

				5. Find and click on `Toolchains` (type *toolchains* into search box).

				6. Select the toolchain you will use, for instance the `Default` one.

				7. Type in the following system-installed tools to be used:

				    - `CMake`: *cmake*

				    - `Build Tool`: *ninja*

				    - `C Compiler`: *clang*

				    - `C++ Compiler`: *clang*

				8. On the `CMake` panel/tab, click on `Reload CMake Project`

				After that, CLion should successfully initialize the CMake project (marked by `[Finished]` in the console) and the

				source code editor should provide code analysis support normally from now on.

				### Distributed compilation: `distcc` and `ccache`

				Scylla's compilations times can be long. Two tools help somewhat:

				- [ccache](https://ccache.samba.org/) caches compiled object files on disk and re-uses them when possible

				- [ccache](https://ccache.samba.org/) caches compiled object files on disk and reuses them when possible

				- [distcc](https://github.com/distcc/distcc) distributes compilation jobs to remote machines

				A reasonably-powered laptop acts as the coordinator for compilation. A second, more powerful, machine acts as a passive compilation server.

				@@ -356,7 +361,7 @@ avoid that the gold linker can be told to create an index with

				More info at https://gcc.gnu.org/wiki/DebugFission.

				Both options can be enable by passing `--split-dwarf` to configure.py.

				Both options can be enabled by passing `--split-dwarf` to configure.py.

				Note that distcc is *not* compatible with it, but icecream

				(https://github.com/icecc/icecream) is.

				@@ -365,7 +370,7 @@ Note that distcc is *not* compatible with it, but icecream

				Sometimes Scylla development is closely tied with a feature being developed in Seastar. It can be useful to compile Scylla with a particular check-out of Seastar.

				One way to do this it to create a local remote for the Seastar submodule in the Scylla repository:

				One way to do this is to create a local remote for the Seastar submodule in the Scylla repository:

				```bash

				$ cd $HOME/src/scylla

									
										2

LICENSE-ScyllaDB-Source-Available.md
									
												View File
												
				@@ -49,7 +49,7 @@ The terms "**You**" or "**Licensee**" refer to any individual accessing or using

				* **Ownership:** Licensor retains sole and exclusive ownership of all rights, interests and title in the Software and any scripts, processes, techniques, methodologies, inventions, know-how, concepts, formatting, arrangements, visual attributes, ideas, database rights, copyrights, patents, trade secrets, and other intellectual property related thereto, and all derivatives, enhancements, modifications and improvements thereof. Except for the limited license rights granted herein, Licensee has no rights in or to the Software and/ or Licensor’s trademarks, logo, or branding and You acknowledge that such Software, trademarks, logo, or branding is the sole property of Licensor.

				* **Feedback:** Licensee is not required to provide any suggestions, enhancement requests, recommendations or other feedback regarding the Software ("Feedback").  If, notwithstanding this policy, Licensee submits Feedback, Licensee understands and acknowledges that such Feedback is not submitted in confidence and Licensor assumes no obligation, expressed or implied, by considering it.  All right in any trademark or logo of Licensor or its affiliates and You shall make no claim of right to the Software or any part thereof to be supplied by Licensor hereunder and acknowledges that as between Licensor and You, such Software is the sole proprietary, title and interest in and to Licensor.such Feedback shall be assigned to, and shall become the sole and exclusive property of, Licensor upon its creation.

				* Except for the rights expressly granted to You under this Agreement, You are not granted any other licenses or rights in the Software or otherwise. This Agreement constitutes the entire agreement between the You and the Licensor with respect to the subject matter hereof and supersedes all prior or contemporaneous communications, representations, or agreements, whether oral or written.

				* Except for the rights expressly granted to You under this Agreement, You are not granted any other licenses or rights in the Software or otherwise. This Agreement constitutes the entire agreement between You and the Licensor with respect to the subject matter hereof and supersedes all prior or contemporaneous communications, representations, or agreements, whether oral or written.

				* **Third-Party Software:** Customer acknowledges that the Software may contain open and closed source components (“OSS Components”) that are governed separately by certain licenses, in each case as further provided by Company upon request. Any applicable OSS Component license is solely between Licensee and the applicable licensor of the OSS Component and Licensee shall comply with the applicable OSS Component license.

				* If any provision of this Agreement is held to be invalid or unenforceable, such provision shall be struck and the remaining provisions shall remain in full force and effect.

3

NOTICE.txt

View File

@@ -1,9 +1,6 @@
 This project includes code developed by the Apache Software Foundation (http://www.apache.org/),
 especially Apache Cassandra.
 It includes files from https://github.com/antonblanchard/crc32-vpmsum (author Anton Blanchard <anton@au.ibm.com>, IBM).
 These files are located in utils/arch/powerpc/crc32-vpmsum. Their license may be found in licenses/LICENSE-crc32-vpmsum.TXT.
 It includes modified code from https://gitbox.apache.org/repos/asf?p=cassandra-dtest.git (owned by The Apache Software Foundation)
 It includes modified tests from https://github.com/etcd-io/etcd.git (owned by The etcd Authors)

									
										4

README.md
									
												View File
												
				@@ -18,7 +18,7 @@ Scylla is fairly fussy about its build environment, requiring very recent

				versions of the C++23 compiler and of many libraries to build. The document

				[HACKING.md](HACKING.md) includes detailed information on building and

				developing Scylla, but to get Scylla building quickly on (almost) any build

				machine, Scylla offers a [frozen toolchain](tools/toolchain/README.md),

				machine, Scylla offers a [frozen toolchain](tools/toolchain/README.md).

				This is a pre-configured Docker image which includes recent versions of all

				the required compilers, libraries and build tools. Using the frozen toolchain

				allows you to avoid changing anything in your build machine to meet Scylla's

				@@ -102,7 +102,7 @@ If you are a developer working on Scylla, please read the [developer guidelines]

				## Contact

				* The [community forum] and [Slack channel] are for users to discuss configuration, management, and operations of the ScyllaDB open source.

				* The [community forum] and [Slack channel] are for users to discuss configuration, management, and operations of ScyllaDB.

				* The [developers mailing list] is for developers and people interested in following the development of ScyllaDB to discuss technical topics.

				[Community forum]: https://forum.scylladb.com/

2

SCYLLA-VERSION-GEN

View File

@@ -78,7 +78,7 @@ fi
 # Default scylla product/version tags
 PRODUCT=scylla
 VERSION=6.3.0-dev
 VERSION=2026.1.0-dev
 if test -f version
 then

									
										4

alternator/CMakeLists.txt
									
												View File
												
				@@ -17,6 +17,7 @@ target_sources(alternator

				    streams.cc

				    consumed_capacity.cc

				    ttl.cc

				    parsed_expression_cache.cc

				    ${cql_grammar_srcs})

				target_include_directories(alternator

				  PUBLIC

				@@ -33,5 +34,8 @@ target_link_libraries(alternator

				    idl

				    absl::headers)

				if (Scylla_USE_PRECOMPILED_HEADER_USE)

				  target_precompile_headers(alternator REUSE_FROM scylla-precompiled-header)

				endif()

				check_headers(check-headers alternator

				  GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)

									
										1

alternator/auth.cc
									
												View File
												
				@@ -11,7 +11,6 @@

				#include "utils/log.hh"

				#include <string>

				#include <string_view>

				#include "bytes.hh"

				#include "alternator/auth.hh"

				#include <fmt/format.h>

				#include "auth/password_authenticator.hh"

									
										11

alternator/consumed_capacity.cc
									
												View File
												
				@@ -24,7 +24,7 @@ static constexpr uint64_t KB = 1024ULL;

				static constexpr uint64_t RCU_BLOCK_SIZE_LENGTH = 4*KB;

				static constexpr uint64_t WCU_BLOCK_SIZE_LENGTH = 1*KB;

				static bool should_add_capacity(const rjson::value& request) {

				bool consumed_capacity_counter::should_add_capacity(const rjson::value& request) {

				    const rjson::value* return_consumed = rjson::find(request, "ReturnConsumedCapacity");

				    if (!return_consumed) {

				        return false;

				@@ -62,15 +62,22 @@ static uint64_t calculate_half_units(uint64_t unit_block_size, uint64_t total_by

				rcu_consumed_capacity_counter::rcu_consumed_capacity_counter(const rjson::value& request, bool is_quorum) :

				        consumed_capacity_counter(should_add_capacity(request)),_is_quorum(is_quorum) {

				}

				uint64_t rcu_consumed_capacity_counter::get_half_units(uint64_t total_bytes, bool is_quorum) noexcept {

				    return calculate_half_units(RCU_BLOCK_SIZE_LENGTH, total_bytes, is_quorum);

				}

				uint64_t rcu_consumed_capacity_counter::get_half_units() const noexcept {

				    return calculate_half_units(RCU_BLOCK_SIZE_LENGTH, _total_bytes, _is_quorum);

				    return get_half_units(_total_bytes, _is_quorum);

				}

				uint64_t wcu_consumed_capacity_counter::get_half_units() const noexcept {

				    return calculate_half_units(WCU_BLOCK_SIZE_LENGTH, _total_bytes, true);

				}

				uint64_t wcu_consumed_capacity_counter::get_units(uint64_t total_bytes) noexcept {

				    return calculate_half_units(WCU_BLOCK_SIZE_LENGTH, total_bytes, true) * HALF_UNIT_MULTIPLIER;

				}

				wcu_consumed_capacity_counter::wcu_consumed_capacity_counter(const rjson::value& request) :

				        consumed_capacity_counter(should_add_capacity(request)) {

				}

									
										6

alternator/consumed_capacity.hh
									
												View File
												
				@@ -42,21 +42,25 @@ public:

				     */

				    virtual uint64_t get_half_units() const noexcept = 0;

				    uint64_t _total_bytes = 0;

				    static bool should_add_capacity(const rjson::value& request);

				protected:

				    bool _should_add_to_reponse = false;

				};

				class rcu_consumed_capacity_counter : public consumed_capacity_counter {

				    virtual uint64_t get_half_units() const noexcept;

				    bool _is_quorum = false;

				public:

				    rcu_consumed_capacity_counter(const rjson::value& request, bool is_quorum);

				    rcu_consumed_capacity_counter(): consumed_capacity_counter(false), _is_quorum(false){}

				    virtual uint64_t get_half_units() const noexcept;

				    static uint64_t get_half_units(uint64_t total_bytes, bool is_quorum) noexcept;

				};

				class wcu_consumed_capacity_counter : public consumed_capacity_counter {

				    virtual uint64_t get_half_units() const noexcept;

				public:

				    wcu_consumed_capacity_counter(const rjson::value& request);

				    static uint64_t get_units(uint64_t total_bytes) noexcept;

				};

				}

									
										8

alternator/controller.cc
									
												View File
												
				@@ -6,7 +6,9 @@

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#include <seastar/core/with_scheduling_group.hh>

				#include <seastar/net/dns.hh>

				#include "controller.hh"

				#include "server.hh"

				#include "executor.hh"

				@@ -134,6 +136,8 @@ future<> controller::start_server() {

				                [this, addr, alternator_port, alternator_https_port, creds = std::move(creds)] (server& server) mutable {

				            return server.init(addr, alternator_port, alternator_https_port, creds,

				                    _config.alternator_enforce_authorization,

				                    _config.alternator_warn_authorization,

				                    _config.alternator_max_users_query_size_in_trace_output,

				                    &_memory_limiter.local().get_semaphore(),

				                    _config.max_concurrent_requests_per_shard);

				        }).handle_exception([this, addr, alternator_port, alternator_https_port] (std::exception_ptr ep) {

				@@ -165,4 +169,8 @@ future<> controller::request_stop_server() {

				    });

				}

				future<utils::chunked_vector<client_data>> controller::get_client_data() {

				    return _server.local().get_client_data();

				}

				}

									
										6

alternator/controller.hh
									
												View File
												
				@@ -11,7 +11,7 @@

				#include <seastar/core/sharded.hh>

				#include <seastar/core/smp.hh>

				#include "protocol_server.hh"

				#include "transport/protocol_server.hh"

				namespace service {

				class storage_proxy;

				@@ -90,6 +90,10 @@ public:

				    virtual future<> start_server() override;

				    virtual future<> stop_server() override;

				    virtual future<> request_stop_server() override;

				    // This virtual function is called (on each shard separately) when the

				    // virtual table "system.clients" is read. It is expected to generate a

				    // list of clients connected to this server (on this shard).

				    virtual future<utils::chunked_vector<client_data>> get_client_data() override;

				};

				}

									
										6

alternator/error.hh
									
												View File
												
				@@ -88,9 +88,15 @@ public:

				    static api_error table_not_found(std::string msg) {

				        return api_error("TableNotFoundException", std::move(msg));

				    }

				    static api_error limit_exceeded(std::string msg) {

				        return api_error("LimitExceededException", std::move(msg));

				    }

				    static api_error internal(std::string msg) {

				        return api_error("InternalServerError", std::move(msg), http::reply::status_type::internal_server_error);

				    }

				    static api_error payload_too_large(std::string msg) {

				        return api_error("PayloadTooLarge", std::move(msg), status_type::payload_too_large);

				    }

				    // Provide the "std::exception" interface, to make it easier to print this

				    // exception in log messages. Note that this function is *not* used to

2862

alternator/executor.cc

View File

File diff suppressed because it is too large Load Diff

									
										83

alternator/executor.hh
									
												View File
												
				@@ -10,8 +10,8 @@

				#include <seastar/core/future.hh>

				#include "seastarx.hh"

				#include <seastar/json/json_elements.hh>

				#include <seastar/core/sharded.hh>

				#include <seastar/util/noncopyable_function.hh>

				#include "service/migration_manager.hh"

				#include "service/client_state.hh"

				@@ -58,33 +58,6 @@ namespace alternator {

				class rmw_operation;

				struct make_jsonable : public json::jsonable {

				    rjson::value _value;

				public:

				    explicit make_jsonable(rjson::value&& value);

				    std::string to_json() const override;

				};

				/**

				 * Make return type for serializing the object "streamed",

				 * i.e. direct to HTTP output stream. Note: only useful for

				 * (very) large objects as there are overhead issues with this

				 * as well, but for massive lists of return objects this can

				 * help avoid large allocations/many re-allocs

				 */

				json::json_return_type make_streamed(rjson::value&&);

				struct json_string : public json::jsonable {

				    std::string _value;

				public:

				    explicit json_string(std::string&& value);

				    std::string to_json() const override;

				};

				namespace parsed {

				class path;

				};

				schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);

				bool is_alternator_keyspace(const sstring& ks_name);

				// Wraps the db::get_tags_of_table and throws if the table is missing the tags extension.

				@@ -155,6 +128,9 @@ using attrs_to_get_node = attribute_path_map_node<std::monostate>;

				// optional means we should get all attributes, not specific ones.

				using attrs_to_get = attribute_path_map<std::monostate>;

				namespace parsed {

				class expression_cache;

				}

				class executor : public peering_sharded_service<executor> {

				    gms::gossiper& _gossiper;

				@@ -163,14 +139,32 @@ class executor : public peering_sharded_service<executor> {

				    db::system_distributed_keyspace& _sdks;

				    cdc::metadata& _cdc_metadata;

				    utils::updateable_value<bool> _enforce_authorization;

				    utils::updateable_value<bool> _warn_authorization;

				    // An smp_service_group to be used for limiting the concurrency when

				    // forwarding Alternator request between shards - if necessary for LWT.

				    smp_service_group _ssg;

				    std::unique_ptr<parsed::expression_cache> _parsed_expression_cache;

				public:

				    using client_state = service::client_state;

				    using request_return_type = std::variant<json::json_return_type, api_error>;

				    // request_return_type is the return type of the executor methods, which

				    // can be one of:

				    // 1. A string, which is the response body for the request.

				    // 2. A body_writer, an asynchronous function (returning future<>) that

				    //    takes an output_stream and writes the response body into it.

				    // 3. An api_error, which is an error response that should be returned to

				    //    the client.

				    // The body_writer is used for streaming responses, where the response body

				    // is written in chunks to the output_stream. This allows for efficient

				    // handling of large responses without needing to allocate a large buffer

				    // in memory.

				    using body_writer = noncopyable_function<future<>(output_stream<char>&&)>;

				    using request_return_type = std::variant<std::string, body_writer, api_error>;

				    stats _stats;

				    // The metric_groups object holds this stat object's metrics registered

				    // as long as the stats object is alive.

				    seastar::metrics::metric_groups _metrics;

				    static constexpr auto ATTRS_COLUMN_NAME = ":attrs";

				    static constexpr auto KEYSPACE_NAME_PREFIX = "alternator_";

				    static constexpr std::string_view INTERNAL_TABLE_PREFIX = ".scylla.alternator.";

				@@ -182,6 +176,7 @@ public:

				             cdc::metadata& cdc_metadata,

				             smp_service_group ssg,

				             utils::updateable_value<uint32_t> default_timeout_in_ms);

				    ~executor();

				    future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);

				    future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);

				@@ -209,26 +204,23 @@ public:

				    future<request_return_type> describe_continuous_backups(client_state& client_state, service_permit permit, rjson::value request);

				    future<> start();

				    future<> stop() {

				        // disconnect from the value source, but keep the value unchanged.

				        s_default_timeout_in_ms = utils::updateable_value<uint32_t>{s_default_timeout_in_ms()};

				        return make_ready_future<>();

				    }

				    future<> stop();

				    static sstring table_name(const schema&);

				    static db::timeout_clock::time_point default_timeout();

				private:

				    static thread_local utils::updateable_value<uint32_t> s_default_timeout_in_ms;

				public:

				    static schema_ptr find_table(service::storage_proxy&, std::string_view table_name);

				    static schema_ptr find_table(service::storage_proxy&, const rjson::value& request);

				private:

				    friend class rmw_operation;

				    static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr);

				    static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr, const std::map<sstring, sstring> *tags = nullptr);

				public:

				    static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&);

				    static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&, const std::map<sstring, sstring> *tags = nullptr);

				    static std::optional<rjson::value> describe_single_item(schema_ptr,

				        const query::partition_slice&,

				@@ -237,11 +229,15 @@ public:

				        const std::optional<attrs_to_get>&,

				        uint64_t* = nullptr);

				    // Converts a multi-row selection result to JSON compatible with DynamoDB.

				    // For each row, this method calls item_callback, which takes the size of

				    // the item as the parameter.

				    static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,

				        const query::partition_slice&& slice,

				        shared_ptr<cql3::selection::selection> selection,

				        foreign_ptr<lw_shared_ptr<query::result>> query_result,

				        shared_ptr<const std::optional<attrs_to_get>> attrs_to_get);

				        shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,

				        noncopyable_function<void(uint64_t)> item_callback = {});

				    static void describe_single_item(const cql3::selection::selection&,

				        const std::vector<managed_bytes_opt>&,

				@@ -250,7 +246,7 @@ public:

				        uint64_t* item_length_in_bytes = nullptr,

				        bool = false);

				    static void add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp);

				    static bool add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp);

				    static void supplement_table_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);

				    static void supplement_table_stream_info(rjson::value& descr, const schema& schema, const service::storage_proxy& sp);

				};

				@@ -269,6 +265,15 @@ bool is_big(const rjson::value& val, int big_size = 100'000);

				// Check CQL's Role-Based Access Control (RBAC) permission (MODIFY,

				// SELECT, DROP, etc.) on the given table. When permission is denied an

				// appropriate user-readable api_error::access_denied is thrown.

				future<> verify_permission(bool enforce_authorization, const service::client_state&, const schema_ptr&, auth::permission);

				future<> verify_permission(bool enforce_authorization, bool warn_authorization, const service::client_state&, const schema_ptr&, auth::permission, alternator::stats& stats);

				/**

				 * Make return type for serializing the object "streamed",

				 * i.e. direct to HTTP output stream. Note: only useful for

				 * (very) large objects as there are overhead issues with this

				 * as well, but for massive lists of return objects this can

				 * help avoid large allocations/many re-allocs

				 */

				executor::body_writer make_streamed(rjson::value&&);

				}

									
										24

alternator/expressions.cc
									
												View File
												
				@@ -165,7 +165,9 @@ static std::optional<std::string> resolve_path_component(const std::string& colu

				                    fmt::format("ExpressionAttributeNames missing entry '{}' required by expression", column_name));

				        }

				        used_attribute_names.emplace(column_name);

				        return std::string(rjson::to_string_view(*value));

				        auto result = std::string(rjson::to_string_view(*value));

				        validate_attr_name_length("", result.size(), false, "ExpressionAttributeNames contains invalid value: ");

				        return result;

				    }

				    return std::nullopt;

				}

				@@ -737,6 +739,26 @@ rjson::value calculate_value(const parsed::set_rhs& rhs,

				    return rjson::null_value();

				}

				void validate_attr_name_length(std::string_view supplementary_context, size_t attr_name_length, bool is_key, std::string_view error_msg_prefix) {

				    constexpr const size_t DYNAMODB_KEY_ATTR_NAME_SIZE_MAX = 255;

				    constexpr const size_t DYNAMODB_NONKEY_ATTR_NAME_SIZE_MAX = 65535;

				    const size_t max_length = is_key ? DYNAMODB_KEY_ATTR_NAME_SIZE_MAX : DYNAMODB_NONKEY_ATTR_NAME_SIZE_MAX;

				    if (attr_name_length > max_length) {

				        std::string error_msg;

				        if (!error_msg_prefix.empty()) {

				            error_msg += error_msg_prefix;

				        }

				        if (!supplementary_context.empty()) {

				            error_msg += "in ";

				            error_msg += supplementary_context;

				            error_msg += " - ";

				        }

				        error_msg += fmt::format("Attribute name is too large, must be less than {} bytes", std::to_string(max_length + 1));

				        throw api_error::validation(error_msg);

				    }

				}

				} // namespace alternator

				auto fmt::formatter<alternator::parsed::path>::format(const alternator::parsed::path& p, fmt::format_context& ctx) const

29

alternator/expressions.g

View File

@@ -91,6 +91,18 @@ options {
         throw expressions_syntax_error(format("{} at char {}", err,
             ex->get_charPositionInLine()));
     }
     // ANTLR3 tries to recover missing tokens - it tries to finish parsing
     // and create valid objects, as if the missing token was there.
     // But it has a bug and leaks these tokens.
     // We override offending method and handle abandoned pointers.
     std::vector<std::unique_ptr<TokenType>> _missing_tokens;
     TokenType* getMissingSymbol(IntStreamType* istream, ExceptionBaseType* e,
                                 ANTLR_UINT32 expectedTokenType, BitsetListType* follow) {
         auto token = BaseType::getMissingSymbol(istream, e, expectedTokenType, follow);
         _missing_tokens.emplace_back(token);
         return token;
     }
 }
 @lexer::context {
     void displayRecognitionError(ANTLR_UINT8** token_names, ExceptionBaseType* ex) {
@@ -184,7 +196,13 @@ path_component: NAME | NAMEREF;
 path returns [parsed::path p]:
     root=path_component           { $p.set_root($root.text); }
     (   '.' name=path_component   { $p.add_dot($name.text); }
       | '[' INTEGER ']'           { $p.add_index(std::stoi($INTEGER.text)); }
       | '[' INTEGER ']'           {
                 try {
                     $p.add_index(std::stoi($INTEGER.text));
                 } catch(std::out_of_range&) {
                     throw expressions_syntax_error("list index out of integer range");
                 }
             }
     )*;
 /* See comment above why the "depth" counter was needed here */
@@ -230,7 +248,7 @@ update_expression_clause returns [parsed::update_expression e]:
 // Note the "EOF" token at the end of the update expression. We want to the
 //  parser to match the entire string given to it - not just its beginning!
 update_expression returns [parsed::update_expression e]:
     (update_expression_clause { e.append($update_expression_clause.e); })* EOF;
     (update_expression_clause { e.append($update_expression_clause.e); })+ EOF;
 projection_expression returns [std::vector<parsed::path> v]:
     p=path      { $v.push_back(std::move($p.p)); }
@@ -257,6 +275,13 @@ primitive_condition returns [parsed::primitive_condition c]:
          (',' v=value[0] { $c.add_value(std::move($v.v)); })*
          ')'
       )?
       {
           // Post-parse check to reject non-function single values
           if ($c._op == parsed::primitive_condition::type::VALUE &&
               !$c._values.front().is_func()) {
               throw expressions_syntax_error("Single value must be a function");
           }
       }
     ;
 // The following rules for parsing boolean expressions are verbose and

									
										24

alternator/expressions.hh
									
												View File
												
				@@ -18,6 +18,8 @@

				#include "expressions_types.hh"

				#include "utils/rjson.hh"

				#include "utils/updateable_value.hh"

				#include "stats.hh"

				namespace alternator {

				@@ -26,6 +28,26 @@ public:

				    using runtime_error::runtime_error;

				};

				namespace parsed {

				class expression_cache_impl;

				class expression_cache {

				    std::unique_ptr<expression_cache_impl> _impl;

				public:

				    struct config {

				        utils::updateable_value<uint32_t> max_cache_entries;

				    };

				    expression_cache(config cfg, stats& stats);

				    ~expression_cache();

				    // stop background tasks, if any

				    future<> stop();

				    update_expression parse_update_expression(std::string_view query);

				    std::vector<path> parse_projection_expression(std::string_view query);

				    condition_expression parse_condition_expression(std::string_view query, const char* caller);

				};

				} // namespace parsed

				// Preferably use parsed::expression_cache instance instead of this free functions.

				parsed::update_expression parse_update_expression(std::string_view query);

				std::vector<parsed::path> parse_projection_expression(std::string_view query);

				parsed::condition_expression parse_condition_expression(std::string_view query, const char* caller);

				@@ -91,5 +113,7 @@ rjson::value calculate_value(const parsed::value& v,

				rjson::value calculate_value(const parsed::set_rhs& rhs,

				        const rjson::value* previous_item);

				void validate_attr_name_length(std::string_view supplementary_context, size_t attr_name_length, bool is_key, std::string_view error_msg_prefix = {});

				} /* namespace alternator */

									
										4

alternator/expressions_types.hh
									
												View File
												
				@@ -209,9 +209,7 @@ public:

				//    function is supported).

				// 2. Ternary operator - v1 BETWEEN v2 and v3 (means v1 >= v2 AND v1 <= v3).

				// 3. N-ary operator - v1 IN ( v2, v3, ... )

				// 4. A single function call (attribute_exists etc.). The parser actually

				//    accepts a more general "value" here but later stages reject a value

				//    which is not a function call (because DynamoDB does it too).

				// 4. A single function call (attribute_exists etc.).

				class primitive_condition {

				public:

				    enum class type {

									
										73

alternator/extract_from_attrs.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,73 @@

				/*

				 * Copyright 2024-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#pragma once

				#include <string>

				#include <string_view>

				#include "utils/rjson.hh"

				#include "serialization.hh"

				#include "schema/column_computation.hh"

				#include "db/view/regular_column_transformation.hh"

				namespace alternator {

				// An implementation of a "column_computation" which extracts a specific

				// non-key attribute from the big map (":attrs") of all non-key attributes,

				// and deserializes it if it has the desired type. GSI will use this computed

				// column as a materialized-view key when the view key attribute isn't a

				// full-fledged CQL column but rather stored in ":attrs".

				class extract_from_attrs_column_computation : public regular_column_transformation {

				    // The name of the CQL column name holding the attribute map. It is a

				    // constant defined in executor.cc (as ":attrs"), so doesn't need

				    // to be specified when constructing the column computation.

				    static const bytes MAP_NAME;

				    // The top-level attribute name to extract from the ":attrs" map.

				    bytes _attr_name;

				    // The type we expect for the value stored in the attribute. If the type

				    // matches the expected type, it is decoded from the serialized format

				    // we store in the map's values) into the raw CQL type value that we use

				    // for keys, and returned by compute_value(). Only the types "S" (string),

				    // "B" (bytes) and "N" (number) are allowed as keys in DynamoDB, and

				    // therefore in desired_type.

				    alternator_type _desired_type;

				public:

				    virtual column_computation_ptr clone() const override;

				    // TYPE_NAME is a unique string that distinguishes this class from other

				    // column_computation subclasses. column_computation::deserialize() will

				    // construct an object of this subclass if it sees a "type" TYPE_NAME.

				    static inline const std::string TYPE_NAME = "alternator_extract_from_attrs";

				    // Serialize the *definition* of this column computation into a JSON

				    // string with a unique "type" string - TYPE_NAME - which then causes

				    // column_computation::deserialize() to create an object from this class.

				    virtual bytes serialize() const override;

				    // Construct this object based on the previous output of serialize().

				    // Calls on_internal_error() if the string doesn't match the output format

				    // of serialize(). "type" is not checked column_computation::deserialize()

				    // won't call this constructor if "type" doesn't match.

				    extract_from_attrs_column_computation(const rjson::value &v);

				    extract_from_attrs_column_computation(bytes_view attr_name, alternator_type desired_type)

				        : _attr_name(attr_name), _desired_type(desired_type)

				        {}

				    // Implement regular_column_transformation's compute_value() that

				    // accepts the full row:

				    result compute_value(const schema& schema, const partition_key& key,

				        const db::view::clustering_or_static_row& row) const override;

				    // But do not implement column_computation's compute_value() that

				    // accepts only a partition key - that's not enough so our implementation

				    // of this function does on_internal_error().

				    bytes compute_value(const schema& schema, const partition_key& key) const override;

				    // This computed column does depend on a non-primary key column, so

				    // its result may change in the update and we need to compute it

				    // before and after the update.

				    virtual bool depends_on_non_primary_key_column() const override {

				        return true;

				    }

				};

				} // namespace alternator

									
										109

alternator/parsed_expression_cache.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,109 @@

				/*

				 * Copyright 2025-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#include "expressions.hh"

				#include "utils/log.hh"

				#include "utils/lru_string_map.hh"

				#include <variant>

				static logging::logger logger_("parsed-expression-cache");

				namespace alternator::parsed {

				struct expression_cache_impl {

				    stats& _stats;

				    using cached_expressions_types = std::variant<

				        update_expression,

				        condition_expression,

				        std::vector<path>

				    >;

				    sized_lru_string_map<cached_expressions_types> _cached_entries;

				    utils::observable<uint32_t>::observer _max_cache_entries_observer;

				    expression_cache_impl(expression_cache::config cfg, stats& stats);

				    // to define the specialized return type of `get_or_create()`

				    template <typename Func, typename... Args>

				    using ParseResult = std::invoke_result_t<Func, std::string_view, Args...>;

				    // Caching layer for parsed expressions

				    // The expression type is determined by the type of the parsing function passed as a parameter,

				    // and the return type is exactly the same as the return type of this parsing function.

				    // StatsType is used only to update appropriate statistics - currently it is aligned with the expression type,

				    // but it could be extended in the future if needed, e.g. split per operation.

				    template <stats::expression_types StatsType, typename Func, typename... Args>

				    ParseResult<Func, Args...> get_or_create(std::string_view query, Func&& parse_func, Args&&... other_args) {

				        if (_cached_entries.disabled()) {

				            return parse_func(query, std::forward<Args>(other_args)...);

				        }

				        if (!_cached_entries.sanity_check()) {

				            _stats.expression_cache.requests[StatsType].misses++;

				            return parse_func(query, std::forward<Args>(other_args)...);

				        }

				        auto value = _cached_entries.find(query);

				        if (value) {

				            logger_.trace("Cache hit for query: {}", query);

				            _stats.expression_cache.requests[StatsType].hits++;

				            try {

				                return std::get<ParseResult<Func, Args...>>(value->get());

				            } catch (const std::bad_variant_access&) {

				                // User can reach this code, by sending the same query string as a different expression type.

				                // In practice valid queries are different enough to not collide.

				                // Entries in cache are only valid queries.

				                // This request will fail at parsing below.

				                // If, by any chance this is a valid query, it will be updated below with the new value.

				                logger_.trace("Cache hit for '{}', but type mismatch.", query);

				                _stats.expression_cache.requests[StatsType].hits--;

				            }

				        } else {

				            logger_.trace("Cache miss for query: {}", query);

				        }

				        ParseResult<Func, Args...> expr = parse_func(query, std::forward<Args>(other_args)...);

				        // Invalid query will throw here ^

				        _stats.expression_cache.requests[StatsType].misses++;

				        if (value) [[unlikely]] {

				            value->get() = cached_expressions_types{expr};

				        } else {

				            _cached_entries.insert(query, cached_expressions_types{expr});

				        }

				        return expr;

				    }

				};

				expression_cache_impl::expression_cache_impl(expression_cache::config cfg, stats& stats) : 

				    _stats(stats), _cached_entries(logger_, _stats.expression_cache.evictions),

				    _max_cache_entries_observer(cfg.max_cache_entries.observe([this] (uint32_t max_value) {

				        _cached_entries.set_max_size(max_value);

				    })) {

				    _cached_entries.set_max_size(cfg.max_cache_entries());

				}

				expression_cache::expression_cache(expression_cache::config cfg, stats& stats) : 

				    _impl(std::make_unique<expression_cache_impl>(std::move(cfg), stats)) {

				}

				expression_cache::~expression_cache() = default;

				future<> expression_cache::stop() {

				    return _impl->_cached_entries.stop();

				}

				update_expression expression_cache::parse_update_expression(std::string_view query) {

				    return _impl->get_or_create<stats::expression_types::UPDATE_EXPRESSION>(query, alternator::parse_update_expression);

				}

				std::vector<path> expression_cache::parse_projection_expression(std::string_view query) {

				    return _impl->get_or_create<stats::expression_types::PROJECTION_EXPRESSION>(query, alternator::parse_projection_expression);

				}

				condition_expression expression_cache::parse_condition_expression(std::string_view query, const char* caller) {

				    return _impl->get_or_create<stats::expression_types::CONDITION_EXPRESSION>(query, alternator::parse_condition_expression, caller);

				}

				} // namespace alternator::parsed

									
										21

alternator/rmw_operation.hh
									
												View File
												
				@@ -8,13 +8,16 @@

				#pragma once

				#include "cdc/cdc_options.hh"

				#include "cdc/log.hh"

				#include "seastarx.hh"

				#include "service/paxos/cas_request.hh"

				#include "service/cas_shard.hh"

				#include "utils/rjson.hh"

				#include "consumed_capacity.hh"

				#include "executor.hh"

				#include "tracing/trace_state.hh"

				#include "keys.hh"

				#include "keys/keys.hh"

				namespace alternator {

				@@ -55,7 +58,7 @@ public:

				    static write_isolation get_write_isolation_for_schema(schema_ptr schema);

				    static write_isolation default_write_isolation;

				public:

				    static void set_default_write_isolation(std::string_view mode);

				protected:

				@@ -106,21 +109,27 @@ public:

				    // violating this). We mark apply() "const" to let the compiler validate

				    // this for us. The output-only field _return_attributes is marked

				    // "mutable" above so that apply() can still write to it.

				    virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const = 0;

				    virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts, cdc::per_request_options& cdc_opts) const = 0;

				    // Convert the above apply() into the signature needed by cas_request:

				    virtual std::optional<mutation> apply(foreign_ptr<lw_shared_ptr<query::result>> qr, const query::partition_slice& slice, api::timestamp_type ts) override;

				    virtual std::optional<mutation> apply(foreign_ptr<lw_shared_ptr<query::result>> qr, const query::partition_slice& slice, api::timestamp_type ts, cdc::per_request_options& cdc_opts) override;

				    virtual ~rmw_operation() = default;

				    const wcu_consumed_capacity_counter& consumed_capacity() const noexcept { return _consumed_capacity; }

				    schema_ptr schema() const { return _schema; }

				    const rjson::value& request() const { return _request; }

				    rjson::value&& move_request() && { return std::move(_request); }

				    future<executor::request_return_type> execute(service::storage_proxy& proxy,

				            std::optional<service::cas_shard> cas_shard,

				            service::client_state& client_state,

				            tracing::trace_state_ptr trace_state,

				            service_permit permit,

				            bool needs_read_before_write,

				            stats& stats,

				            stats& global_stats,

				            stats& per_table_stats,

				            uint64_t& wcu_total);

				    std::optional<shard_id> shard_for_execute(bool needs_read_before_write);

				    std::optional<service::cas_shard> shard_for_execute(bool needs_read_before_write);

				private:

				    inline bool should_fill_preimage() const { return _schema->cdc_options().enabled(); }

				};

				} // namespace alternator

									
										70

alternator/serialization.cc
									
												View File
												
				@@ -11,8 +11,8 @@

				#include "utils/log.hh"

				#include "serialization.hh"

				#include "error.hh"

				#include "concrete_types.hh"

				#include "cql3/type_json.hh"

				#include "types/concrete_types.hh"

				#include "types/json_utils.hh"

				#include "mutation/position_in_partition.hh"

				static logging::logger slogger("alternator-serialization");

				@@ -245,6 +245,27 @@ rjson::value deserialize_item(bytes_view bv) {

				    return deserialized;

				}

				// This function takes a bytes_view created earlier by serialize_item(), and

				// if has the type "expected_type", the function returns the value as a

				// raw Scylla type. If the type doesn't match, returns an unset optional.

				// This function only supports the key types S (string), B (bytes) and N

				// (number) - serialize_item() serializes those types as a single-byte type

				// followed by the serialized raw Scylla type, so all this function needs to

				// do is to remove the first byte. This makes this function much more

				// efficient than deserialize_item() above because it avoids transformation

				// to/from JSON.

				std::optional<bytes> serialized_value_if_type(bytes_view bv, alternator_type expected_type) {

				    if (bv.empty() || alternator_type(bv[0]) != expected_type) {

				        return std::nullopt;

				    }

				    // Currently, serialize_item() for types in alternator_type (notably S, B

				    // and N) are nothing more than Scylla's raw format for these types

				    // preceded by a type byte. So we just need to skip that byte and we are

				    // left by exactly what we need to return.

				    bv.remove_prefix(1);

				    return bytes(bv);

				}

				std::string type_to_string(data_type type) {

				    static thread_local std::unordered_map<data_type, std::string> types = {

				        {utf8_type, "S"},

				@@ -261,15 +282,23 @@ std::string type_to_string(data_type type) {

				    return it->second;

				}

				bytes get_key_column_value(const rjson::value& item, const column_definition& column) {

				std::optional<bytes> try_get_key_column_value(const rjson::value& item, const column_definition& column) {

				    std::string column_name = column.name_as_text();

				    const rjson::value* key_typed_value = rjson::find(item, column_name);

				    if (!key_typed_value) {

				        throw api_error::validation(fmt::format("Key column {} not found", column_name));

				        return std::nullopt;

				    }

				    return get_key_from_typed_value(*key_typed_value, column);

				}

				bytes get_key_column_value(const rjson::value& item, const column_definition& column) {

				    auto value = try_get_key_column_value(item, column);

				    if (!value) {

				        throw api_error::validation(fmt::format("Key column {} not found", column.name_as_text()));

				    }

				    return std::move(*value);

				}

				// Parses the JSON encoding for a key value, which is a map with a single

				// entry whose key is the type and the value is the encoded value.

				// If this type does not match the desired "type_str", an api_error::validation

				@@ -359,20 +388,38 @@ clustering_key ck_from_json(const rjson::value& item, schema_ptr schema) {

				        return clustering_key::make_empty();

				    }

				    std::vector<bytes> raw_ck;

				    // FIXME: this is a loop, but we really allow only one clustering key column.

				    // Note: it's possible to get more than one clustering column here, as

				    // Alternator can be used to read scylla internal tables.

				    for (const column_definition& cdef : schema->clustering_key_columns()) {

				        bytes raw_value = get_key_column_value(item,  cdef);

				        auto raw_value = get_key_column_value(item,  cdef);

				        raw_ck.push_back(std::move(raw_value));

				    }

				    return clustering_key::from_exploded(raw_ck);

				}

				position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema) {

				    auto ck = ck_from_json(item, schema);

				    if (is_alternator_keyspace(schema->ks_name())) {

				        return position_in_partition::for_key(std::move(ck));

				clustering_key_prefix ck_prefix_from_json(const rjson::value& item, schema_ptr schema) {

				    if (schema->clustering_key_size() == 0) {

				        return clustering_key_prefix::make_empty();

				    }

				    std::vector<bytes> raw_ck;

				    for (const column_definition& cdef : schema->clustering_key_columns()) {

				        auto raw_value = try_get_key_column_value(item,  cdef);

				        if (!raw_value) {

				            break;

				        }

				        raw_ck.push_back(std::move(*raw_value));

				    }

				    return clustering_key_prefix::from_exploded(raw_ck);

				}

				position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema) {

				    const bool is_alternator_ks = is_alternator_keyspace(schema->ks_name());

				    if (is_alternator_ks) {

				        return position_in_partition::for_key(ck_from_json(item, schema));

				    }

				    const auto region_item = rjson::find(item, scylla_paging_region);

				    const auto weight_item = rjson::find(item, scylla_paging_weight);

				    if (bool(region_item) != bool(weight_item)) {

				@@ -392,8 +439,9 @@ position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema)

				        } else {

				            throw std::runtime_error(fmt::format("Invalid value for weight: {}", weight_view));

				        }

				        return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(std::move(ck)) : std::nullopt);

				        return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(ck_prefix_from_json(item, schema)) : std::nullopt);

				    }

				    auto ck = ck_from_json(item, schema);

				    if (ck.is_empty()) {

				        return position_in_partition::for_partition_start();

				    }

									
										3

alternator/serialization.hh
									
												View File
												
				@@ -13,7 +13,7 @@

				#include <optional>

				#include "types/types.hh"

				#include "schema/schema_fwd.hh"

				#include "keys.hh"

				#include "keys/keys.hh"

				#include "utils/rjson.hh"

				#include "utils/big_decimal.hh"

				@@ -43,6 +43,7 @@ type_representation represent_type(alternator_type atype);

				bytes serialize_item(const rjson::value& item);

				rjson::value deserialize_item(bytes_view bv);

				std::optional<bytes> serialized_value_if_type(bytes_view bv, alternator_type expected_type);

				std::string type_to_string(data_type type);

									
										486

alternator/server.cc
									
												View File
												
				@@ -13,7 +13,7 @@

				#include <seastar/http/function_handlers.hh>

				#include <seastar/http/short_streams.hh>

				#include <seastar/core/coroutine.hh>

				#include <seastar/json/json_elements.hh>

				#include <seastar/coroutine/maybe_yield.hh>

				#include <seastar/util/defer.hh>

				#include <seastar/util/short_streams.hh>

				#include "seastarx.hh"

				@@ -31,6 +31,9 @@

				#include "gms/gossiper.hh"

				#include "utils/overloaded_functor.hh"

				#include "utils/aws_sigv4.hh"

				#include "client_data.hh"

				#include "utils/updateable_value.hh"

				#include <zlib.h>

				static logging::logger slogger("alternator-server");

				@@ -100,6 +103,13 @@ static void handle_CORS(const request& req, reply& rep, bool preflight) {

				// the user directly. Other exceptions are unexpected, and reported as

				// Internal Server Error.

				class api_handler : public handler_base {

				    // Although the the DynamoDB API responses are JSON, additional

				    // conventions apply to these responses. For this reason, DynamoDB uses

				    // the content type "application/x-amz-json-1.0" instead of the standard

				    // "application/json". Some other AWS services use later versions instead

				    // of "1.0", but DynamoDB currently uses "1.0". Note that this content

				    // type applies to all replies, both success and error.

				    static constexpr const char* REPLY_CONTENT_TYPE = "application/x-amz-json-1.0";

				public:

				    api_handler(const std::function<future<executor::request_return_type>(std::unique_ptr<request> req)>& _handle) : _f_handle(

				         [this, _handle](std::unique_ptr<request> req, std::unique_ptr<reply> rep) {

				@@ -124,22 +134,19 @@ public:

				             }

				             auto res = resf.get();

				             std::visit(overloaded_functor {

				                 [&] (const json::json_return_type& json_return_value) {

				                     slogger.trace("api_handler success case");

				                     if (json_return_value._body_writer) {

				                         // Unfortunately, write_body() forces us to choose

				                         // from a fixed and irrelevant list of "mime-types"

				                         // at this point. But we'll override it with the

				                         // one (application/x-amz-json-1.0) below.

				                         rep->write_body("json", std::move(json_return_value._body_writer));

				                     } else {

				                         rep->_content += json_return_value._res;

				                     }

				                 },

				                 [&] (const api_error& err) {

				                     generate_error_reply(*rep, err);

				                 }

				             }, res);

				                [&] (std::string&& str) {

				                    // Note that despite the move, there is a copy here -

				                    // as str is std::string and rep->_content is sstring.

				                    rep->_content = std::move(str);

				                    rep->set_content_type(REPLY_CONTENT_TYPE);

				                },

				                [&] (executor::body_writer&& body_writer) {

				                    rep->write_body(REPLY_CONTENT_TYPE, std::move(body_writer));

				                },

				                [&] (const api_error& err) {

				                    generate_error_reply(*rep, err);

				                }

				             }, std::move(res));

				             return make_ready_future<std::unique_ptr<reply>>(std::move(rep));

				         });

				@@ -151,7 +158,6 @@ public:

				        handle_CORS(*req, *rep, false);

				        return _f_handle(std::move(req), std::move(rep)).then(

				                [](std::unique_ptr<reply> rep) {

				                    rep->set_mime_type("application/x-amz-json-1.0");

				                    rep->done();

				                    return make_ready_future<std::unique_ptr<reply>>(std::move(rep));

				                });

				@@ -167,6 +173,7 @@ protected:

				        rjson::add(results, "message", err._msg);

				        rep._content = rjson::print(std::move(results));

				        rep._status = err._http_code;

				        rep.set_content_type(REPLY_CONTENT_TYPE);

				        slogger.trace("api_handler error case: {}", rep._content);

				    }

				@@ -217,7 +224,7 @@ protected:

				        // If the DC does not exist, we return an empty list - not an error.

				        sstring query_dc = req->get_query_param("dc");

				        sstring local_dc = query_dc.empty() ? topology.get_datacenter() : query_dc;

				        std::unordered_set<gms::inet_address> local_dc_nodes;

				        std::unordered_set<locator::host_id> local_dc_nodes;

				        const auto& endpoints = topology.get_datacenter_endpoints();

				        auto dc_it = endpoints.find(local_dc);

				        if (dc_it != endpoints.end()) {

				@@ -227,9 +234,9 @@ protected:

				        // DC, unless a single rack is selected by the "rack" query option.

				        // If the rack does not exist, we return an empty list - not an error.

				        sstring query_rack = req->get_query_param("rack");

				        for (auto& ip : local_dc_nodes) {

				        for (auto& id : local_dc_nodes) {

				            if (!query_rack.empty()) {

				                auto rack = _gossiper.get_application_state_value(ip, gms::application_state::RACK);

				                auto rack = _gossiper.get_application_state_value(id, gms::application_state::RACK);

				                if (rack != query_rack) {

				                    continue;

				                }

				@@ -237,10 +244,10 @@ protected:

				            // Note that it's not enough for the node to be is_alive() - a

				            // node joining the cluster is also "alive" but not responsive to

				            // requests. We alive *and* normal. See #19694, #21538.

				            if (_gossiper.is_alive(ip) && _gossiper.is_normal(ip)) {

				            if (_gossiper.is_alive(id) && _gossiper.is_normal(id)) {

				                // Use the gossiped broadcast_rpc_address if available instead

				                // of the internal IP address "ip". See discussion in #18711.

				                rjson::push_back(results, rjson::from_string(_gossiper.get_rpc_address(ip)));

				                rjson::push_back(results, rjson::from_string(_gossiper.get_rpc_address(id)));

				            }

				        }

				        rep->set_status(reply::status_type::ok);

				@@ -266,24 +273,57 @@ protected:

				    }

				};

				// This function increments the authentication_failures counter, and may also

				// log a warn-level message and/or throw an exception, depending on what

				// enforce_authorization and warn_authorization are set to.

				// The username and client address are only used for logging purposes -

				// they are not included in the error message returned to the client, since

				// the client knows who it is.

				// Note that if enforce_authorization is false, this function will return

				// without throwing. So a caller that doesn't want to continue after an

				// authentication_error must explicitly return after calling this function.

				template<typename Exception>

				static void authentication_error(alternator::stats& stats, bool enforce_authorization, bool warn_authorization, Exception&& e, std::string_view user, gms::inet_address client_address) {

				    stats.authentication_failures++;

				    if (enforce_authorization) {

				        if (warn_authorization) {

				            slogger.warn("alternator_warn_authorization=true: {} for user {}, client address {}", e.what(), user, client_address);

				        }

				        throw std::move(e);

				    } else {

				        if (warn_authorization) {

				            slogger.warn("If you set alternator_enforce_authorization=true the following will be enforced: {} for user {}, client address {}", e.what(), user, client_address);

				        }

				    }

				}

				future<std::string> server::verify_signature(const request& req, const chunked_content& content) {

				    if (!_enforce_authorization) {

				    if (!_enforce_authorization.get() && !_warn_authorization.get()) {

				        slogger.debug("Skipping authorization");

				        return make_ready_future<std::string>();

				    }

				    auto host_it = req._headers.find("Host");

				    if (host_it == req._headers.end()) {

				        throw api_error::invalid_signature("Host header is mandatory for signature verification");

				        authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				            api_error::invalid_signature("Host header is mandatory for signature verification"), 

				            "", req.get_client_address());

				        return make_ready_future<std::string>();

				    }

				    auto authorization_it = req._headers.find("Authorization");

				    if (authorization_it == req._headers.end()) {

				        throw api_error::missing_authentication_token("Authorization header is mandatory for signature verification");

				        authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				            api_error::missing_authentication_token("Authorization header is mandatory for signature verification"),

				            "", req.get_client_address());

				        return make_ready_future<std::string>();

				    }

				    std::string host = host_it->second;

				    std::string_view authorization_header = authorization_it->second;

				    auto pos = authorization_header.find_first_of(' ');

				    if (pos == std::string_view::npos || authorization_header.substr(0, pos) != "AWS4-HMAC-SHA256") {

				        throw api_error::invalid_signature(fmt::format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header));

				        authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				            api_error::invalid_signature(fmt::format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header)),

				            "", req.get_client_address());

				        return make_ready_future<std::string>();

				    }

				    authorization_header.remove_prefix(pos+1);

				    std::string credential;

				@@ -318,7 +358,9 @@ future<std::string> server::verify_signature(const request& req, const chunked_c

				    std::vector<std::string_view> credential_split = split(credential, '/');

				    if (credential_split.size() != 5) {

				        throw api_error::validation(fmt::format("Incorrect credential information format: {}", credential));

				        authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				            api_error::validation(fmt::format("Incorrect credential information format: {}", credential)), "", req.get_client_address());

				        return make_ready_future<std::string>();

				    }

				    std::string user(credential_split[0]);

				    std::string datestamp(credential_split[1]);

				@@ -342,7 +384,7 @@ future<std::string> server::verify_signature(const request& req, const chunked_c

				    auto cache_getter = [&proxy = _proxy, &as = _auth_service] (std::string username) {

				        return get_key_from_roles(proxy, as, std::move(username));

				    };

				    return _key_cache.get_ptr(user, cache_getter).then([this, &req, &content,

				    return _key_cache.get_ptr(user, cache_getter).then_wrapped([this, &req, &content,

				                                                    user = std::move(user),

				                                                    host = std::move(host),

				                                                    datestamp = std::move(datestamp),

				@@ -350,18 +392,32 @@ future<std::string> server::verify_signature(const request& req, const chunked_c

				                                                    signed_headers_map = std::move(signed_headers_map),

				                                                    region = std::move(region),

				                                                    service = std::move(service),

				                                                    user_signature = std::move(user_signature)] (key_cache::value_ptr key_ptr) {

				                                                    user_signature = std::move(user_signature)] (future<key_cache::value_ptr> key_ptr_fut) {

				        key_cache::value_ptr key_ptr(nullptr);

				        try {

				            key_ptr = key_ptr_fut.get();

				        } catch (const api_error& e) {

				            authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				                e, user, req.get_client_address());

				            return std::string();

				        }

				        std::string signature;

				        try {

				            signature = utils::aws::get_signature(user, *key_ptr, std::string_view(host), "/", req._method,

				                datestamp, signed_headers_str, signed_headers_map, &content, region, service, "");

				        } catch (const std::exception& e) {

				            throw api_error::invalid_signature(e.what());

				            authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				                api_error::invalid_signature(fmt::format("invalid signature: {}", e.what())),

				                user, req.get_client_address());

				            return std::string();

				        }

				        if (signature != std::string_view(user_signature)) {

				            _key_cache.remove(user);

				            throw api_error::unrecognized_client("The security token included in the request is invalid.");

				            authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				                api_error::unrecognized_client("wrong signature"),

				                user, req.get_client_address());

				            return std::string();

				        }

				        return user;

				    });

				@@ -374,35 +430,82 @@ static tracing::trace_state_ptr create_tracing_session(tracing::tracing& tracing

				    return tracing_instance.create_session(tracing::trace_type::QUERY, props);

				}

				// truncated_content_view() prints a potentially long chunked_content for

				// debugging purposes. In the common case when the content is not excessively

				// long, it just returns a view into the given content, without any copying.

				// But when the content is very long, it is truncated after some arbitrary

				// max_len (or one chunk, whichever comes first), with "<truncated>" added at

				// the end. To do this modification to the string, we need to create a new

				// std::string, so the caller must pass us a reference to one, "buf", where

				// we can store the content. The returned view is only alive for as long this

				// buf is kept alive.

				static std::string_view truncated_content_view(const chunked_content& content, std::string& buf) {

				    constexpr size_t max_len = 1024;

				    if (content.empty()) {

				        return std::string_view();

				    } else if (content.size() == 1 && content.begin()->size() <= max_len) {

				        return std::string_view(content.begin()->get(), content.begin()->size());

				    } else {

				        buf = std::string(content.begin()->get(), std::min(content.begin()->size(), max_len)) + "<truncated>";

				        return std::string_view(buf);

				// A helper class to represent a potentially truncated view of a chunked_content.

				// If the content is short enough and single chunked, it just holds a view into the content.

				// Otherwise it will be copied into an internal buffer, possibly truncated (depending on maximum allowed size passed in),

				// and the view will point into that buffer.

				// `as_view()` method will return the view.

				// `take_as_sstring()` will either move out the internal buffer (if any), or create a new sstring from the view.

				// You should consider `as_view()` valid as long both the original chunked_content and the truncated_content object are alive.

				class truncated_content {

				    std::string_view _view;

				    sstring _content_maybe;

				    void copy_from_content(const chunked_content& content) {

				        size_t offset = 0;

				        for(auto &tmp : content) {

				            size_t to_copy = std::min(tmp.size(), _content_maybe.size() - offset);

				            std::copy(tmp.get(), tmp.get() + to_copy, _content_maybe.data() + offset);

				            offset += to_copy;

				            if (offset >= _content_maybe.size()) {

				                break;

				            }

				        }

				    }

				public:

				    truncated_content(const chunked_content& content, size_t max_len = std::numeric_limits<size_t>::max()) {

				        if (content.empty()) return;

				        if (content.size() == 1 && content.begin()->size() <= max_len) {

				            _view = std::string_view(content.begin()->get(), content.begin()->size());

				            return;

				        }

				        constexpr std::string_view truncated_text = "<truncated>";

				        size_t content_size = 0;

				        for(auto &tmp : content) {

				            content_size += tmp.size();

				        }

				        if (content_size <= max_len) {

				            _content_maybe = sstring{ sstring::initialized_later{}, content_size };

				            copy_from_content(content);

				        }

				        else {

				            _content_maybe = sstring{ sstring::initialized_later{}, max_len + truncated_text.size() };

				            copy_from_content(content);

				            std::copy(truncated_text.begin(), truncated_text.end(), _content_maybe.data() + _content_maybe.size() - truncated_text.size());

				        }

				        _view = std::string_view(_content_maybe);

				    }

				    std::string_view as_view() const { return _view; }

				    sstring take_as_sstring() && {

				        if (_content_maybe.empty() && !_view.empty()) {

				            return sstring{_view};

				        }

				        return std::move(_content_maybe);

				    }

				};

				// `truncated_content_view` will produce an object representing a view to a passed content

				// possibly truncated at some length. The value returned is used in two ways:

				// - to print it in logs (use `as_view()` method for this)

				// - to pass it to tracing object, where it will be stored and used later

				//   (use `take_as_sstring()` method as this produces a copy in form of a sstring)

				// `truncated_content` delays constructing `sstring` object until it's actually needed.

				// `truncated_content` is valid as long as passed `content` is alive.

				// if the content is truncated, `<truncated>` will be appended at the maximum size limit

				// and total size will be `max_users_query_size_in_trace_output() + strlen("<truncated>")`.

				static truncated_content truncated_content_view(const chunked_content& content, size_t max_size) {

				    return truncated_content{content, max_size};

				}

				static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, std::string_view username, std::string_view op, const chunked_content& query) {

				static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, std::string_view username, std::string_view op, const chunked_content& query, size_t max_users_query_size_in_trace_output) {

				    tracing::trace_state_ptr trace_state;

				    tracing::tracing& tracing_instance = tracing::tracing::get_local_tracing_instance();

				    if (tracing_instance.trace_next_query() || tracing_instance.slow_query_tracing_enabled()) {

				        trace_state = create_tracing_session(tracing_instance);

				        std::string buf;

				        tracing::add_session_param(trace_state, "alternator_op", op);

				        tracing::add_query(trace_state, truncated_content_view(query, buf));

				        tracing::add_query(trace_state, truncated_content_view(query, max_users_query_size_in_trace_output).take_as_sstring());

				        tracing::begin(trace_state, seastar::format("Alternator {}", op), client_state.get_client_address());

				        if (!username.empty()) {

				            tracing::set_username(trace_state, auth::authenticated_user(username));

				@@ -411,30 +514,207 @@ static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_

				    return trace_state;

				}

				// This read_entire_stream() is similar to Seastar's read_entire_stream()

				// which reads the given content_stream until its end into non-contiguous

				// memory. The difference is that this implementation takes an extra length

				// limit, and throws an error if we read more than this limit.

				// This length-limited variant would not have been needed if Seastar's HTTP

				// server's set_content_length_limit() worked in every case, but unfortunately

				// it does not - it only works if the request has a Content-Length header (see

				// issue #8196). In contrast this function can limit the request's length no

				// matter how it's encoded. We need this limit to protect Alternator from

				// oversized requests that can deplete memory.

				static future<chunked_content>

				read_entire_stream(input_stream<char>& inp, size_t length_limit) {

				    chunked_content ret;

				    // We try to read length_limit + 1 bytes, so that we can throw an

				    // exception if we managed to read more than length_limit.

				    ssize_t remain = length_limit + 1;

				    do {

				        temporary_buffer<char> buf = co_await inp.read_up_to(remain);

				        if (buf.empty()) {

				            break;

				        }

				        remain -= buf.size();

				        ret.push_back(std::move(buf));

				    } while (remain > 0);

				    // If we read the full length_limit + 1 bytes, we went over the limit:

				    if (remain <= 0) {

				        // By throwing here an error, we may send a reply (the error message)

				        // without having read the full request body. Seastar's httpd will

				        // realize that we have not read the entire content stream, and

				        // correctly mark the connection unreusable, i.e., close it.

				        // This means we are currently exposed to issue #12166 caused by

				        // Seastar issue 1325), where the client may get an RST instead of

				        // a FIN, and may rarely get a "Connection reset by peer" before

				        // reading the error we send.

				        throw api_error::payload_too_large(fmt::format("Request content length limit of {} bytes exceeded", length_limit));

				    }

				    co_return ret;

				}

				// safe_gzip_stream is an exception-safe wrapper for zlib's z_stream.

				// The "z_stream" struct is used by zlib to hold state while decompressing a

				// stream of data. It allocates memory which must be freed with inflateEnd(),

				// which the destructor of this class does.

				class safe_gzip_zstream {

				    z_stream _zs;

				public:

				    safe_gzip_zstream() {

				        memset(&_zs, 0, sizeof(_zs));

				        // The strange 16 + WMAX_BITS tells zlib to expect and decode

				        // a gzip header, not a zlib header.

				        if (inflateInit2(&_zs, 16 + MAX_WBITS) != Z_OK) {

				            // Should only happen if memory allocation fails

				            throw std::bad_alloc();

				        }

				    }

				    ~safe_gzip_zstream() {

				        inflateEnd(&_zs);

				    }

				    z_stream* operator->() {

				        return &_zs;

				    }

				    z_stream* get() {

				        return &_zs;

				    }

				    void reset() {

				        inflateReset(&_zs);

				    }

				};

				// ungzip() takes a chunked_content with a gzip-compressed request body,

				// uncompresses it, and returns the uncompressed content as a chunked_content.

				// If the uncompressed content exceeds length_limit, an error is thrown.

				static future<chunked_content>

				ungzip(chunked_content&& compressed_body, size_t length_limit) {

				    chunked_content ret;

				    // output_buf can be any size - when uncompressing input_buf, it doesn't

				    // need to fit in a single output_buf, we'll use multiple output_buf for

				    // a single input_buf if needed.

				    constexpr size_t OUTPUT_BUF_SIZE = 4096;

				    temporary_buffer<char> output_buf;

				    safe_gzip_zstream strm;

				    bool complete_stream = false; // empty input is not a valid gzip

				    size_t total_out_bytes = 0;

				    for (const temporary_buffer<char>& input_buf : compressed_body) {

				        if (input_buf.empty()) {

				            continue;

				        }

				        complete_stream = false;

				        strm->next_in = (Bytef*) input_buf.get();

				        strm->avail_in = (uInt) input_buf.size();

				        do {

				            co_await coroutine::maybe_yield();

				            if (output_buf.empty()) {

				                output_buf = temporary_buffer<char>(OUTPUT_BUF_SIZE);

				            }

				            strm->next_out = (Bytef*) output_buf.get();

				            strm->avail_out = OUTPUT_BUF_SIZE;

				            int e = inflate(strm.get(), Z_NO_FLUSH);

				            size_t out_bytes = OUTPUT_BUF_SIZE - strm->avail_out;

				            if (out_bytes > 0) {

				                // If output_buf is nearly full, we save it as-is in ret. But

				                // if it only has little data, better copy to a small buffer.

				                if (out_bytes > OUTPUT_BUF_SIZE/2) {

				                    ret.push_back(std::move(output_buf).prefix(out_bytes));

				                    // output_buf is now empty. if this loop finds more input,

				                    // we'll allocate a new output buffer.

				                } else {

				                    ret.push_back(temporary_buffer<char>(output_buf.get(), out_bytes));

				                }

				                total_out_bytes += out_bytes;

				                if (total_out_bytes > length_limit) {

				                    throw api_error::payload_too_large(fmt::format("Request content length limit of {} bytes exceeded", length_limit));

				                }

				            }

				            if (e == Z_STREAM_END) {

				                // There may be more input after the first gzip stream - in

				                // either this input_buf or the next one. The additional input

				                // should be a second concatenated gzip. We need to allow that

				                // by resetting the gzip stream and continuing the input loop

				                // until there's no more input.

				                strm.reset();

				                if (strm->avail_in == 0) {

				                    complete_stream = true;

				                    break;

				                }

				            } else if (e != Z_OK && e != Z_BUF_ERROR) {

				                // DynamoDB returns an InternalServerError when given a bad

				                // gzip request body. See test test_broken_gzip_content

				                throw api_error::internal("Error during gzip decompression of request body");

				            }

				        } while (strm->avail_in > 0 || strm->avail_out == 0);

				    }

				    if (!complete_stream) {

				        // The gzip stream was not properly finished with Z_STREAM_END

				        throw api_error::internal("Truncated gzip in request body");

				    }

				    co_return ret;

				}

				future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request> req) {

				    _executor._stats.total_operations++;

				    sstring target = req->get_header("X-Amz-Target");

				    // target is DynamoDB API version followed by a dot '.' and operation type (e.g. CreateTable)

				    auto dot = target.find('.');

				    std::string_view op = (dot == sstring::npos) ? std::string_view() : std::string_view(target).substr(dot+1);

				    if (req->content_length > request_content_length_limit) {

				        // If we have a Content-Length header and know the request will be too

				        // long, we don't need to wait for read_entire_stream() below to

				        // discover it. And we definitely mustn't try to get_units() below for

				        // for such a size.

				        co_return api_error::payload_too_large(fmt::format("Request content length limit of {} bytes exceeded", request_content_length_limit));

				    }

				    // JSON parsing can allocate up to roughly 2x the size of the raw

				    // document, + a couple of bytes for maintenance.

				    // TODO: consider the case where req->content_length is missing. Maybe

				    // we need to take the content_length_limit and return some of the units

				    // when we finish read_content_and_verify_signature?

				    size_t mem_estimate = req->content_length * 2 + 8000;

				    // If the Content-Length of the request is not available, we assume

				    // the largest possible request (request_content_length_limit, i.e., 16 MB)

				    // and after reading the request we return_units() the excess.

				    size_t mem_estimate = (req->content_length ? req->content_length : request_content_length_limit) * 2 + 8000;

				    auto units_fut = get_units(*_memory_limiter, mem_estimate);

				    if (_memory_limiter->waiters()) {

				        ++_executor._stats.requests_blocked_memory;

				    }

				    auto units = co_await std::move(units_fut);

				    SCYLLA_ASSERT(req->content_stream);

				    chunked_content content = co_await util::read_entire_stream(*req->content_stream);

				    chunked_content content = co_await read_entire_stream(*req->content_stream, request_content_length_limit);

				    // If the request had no Content-Length, we reserved too many units

				    // so need to return some

				    if (req->content_length == 0) {

				        size_t content_length = 0;

				        for (const auto& chunk : content) {

				            content_length += chunk.size();

				        }

				        size_t new_mem_estimate = content_length * 2 + 8000;

				        units.return_units(mem_estimate - new_mem_estimate);

				    }

				    auto username = co_await verify_signature(*req, content);

				    // If the request is compressed, uncompress it now, after we checked

				    // the signature (the signature is computed on the compressed content).

				    // We apply the request_content_length_limit again to the uncompressed

				    // content - we don't want to allow a tiny compressed request to

				    // expand to a huge uncompressed request.

				    sstring content_encoding = req->get_header("Content-Encoding");

				    if (content_encoding == "gzip") {

				        content = co_await ungzip(std::move(content), request_content_length_limit);

				    } else if (!content_encoding.empty()) {

				        // DynamoDB returns a 500 error for unsupported Content-Encoding.

				        // I'm not sure if this is the best error code, but let's do it too.

				        // See the test test_garbage_content_encoding confirming this case.

				        co_return api_error::internal("Unsupported Content-Encoding");

				    }

				    // As long as the system_clients_entry object is alive, this request will

				    // be visible in the "system.clients" virtual table. When requested, this

				    // entry will be formatted by server::ongoing_request::make_client_data().

				    auto system_clients_entry = _ongoing_requests.emplace(

				        req->get_client_address(), req->get_header("User-Agent"),

				        username, current_scheduling_group(),

				        req->get_protocol_name() == "https");

				    if (slogger.is_enabled(log_level::trace)) {

				        std::string buf;

				        slogger.trace("Request: {} {} {}", op, truncated_content_view(content, buf), req->_headers);

				        slogger.trace("Request: {} {} {}", op, truncated_content_view(content, _max_users_query_size_in_trace_output).as_view(), req->_headers);

				    }

				    auto callback_it = _callbacks.find(op);

				    if (callback_it == _callbacks.end()) {

				@@ -454,11 +734,21 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr

				    }

				    co_await client_state.maybe_update_per_service_level_params();

				    tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content);

				    tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content, _max_users_query_size_in_trace_output.get());

				    tracing::trace(trace_state, "{}", op);

				    rjson::value json_request = co_await _json_parser.parse(std::move(content));

				    co_return co_await callback_it->second(_executor, client_state, trace_state,

				            make_service_permit(std::move(units)), std::move(json_request), std::move(req));

				    auto user = client_state.user();

				    auto f = [this, content = std::move(content), &callback = callback_it->second,

				            client_state = std::move(client_state), trace_state = std::move(trace_state),

				            units = std::move(units), req = std::move(req)] () mutable -> future<executor::request_return_type> {

				                rjson::value json_request = co_await _json_parser.parse(std::move(content));

				                if (!json_request.IsObject()) {

				                    co_return api_error::validation("Request content must be an object");

				                }

				                co_return co_await callback(_executor, client_state, trace_state,

				                    make_service_permit(std::move(units)), std::move(json_request), std::move(req));

				    };

				    co_return co_await _sl_controller.with_user_service_level(user, std::ref(f));

				}

				void server::set_routes(routes& r) {

				@@ -495,9 +785,9 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos

				        , _auth_service(auth_service)

				        , _sl_controller(sl_controller)

				        , _key_cache(1024, 1min, slogger)

				        , _enforce_authorization(false)

				        , _max_users_query_size_in_trace_output(1024)

				        , _enabled_servers{}

				        , _pending_requests{}

				        , _pending_requests("alternator::server::pending_requests")

				        , _timeout_config(_proxy.data_dictionary().get_config())

				      , _callbacks{

				        {"CreateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {

				@@ -576,10 +866,13 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos

				}

				future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,

				        utils::updateable_value<bool> enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {

				        utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization, utils::updateable_value<uint64_t> max_users_query_size_in_trace_output,

				        semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {

				    _memory_limiter = memory_limiter;

				    _enforce_authorization = std::move(enforce_authorization);

				    _warn_authorization = std::move(warn_authorization);

				    _max_concurrent_requests = std::move(max_concurrent_requests);

				    _max_users_query_size_in_trace_output = std::move(max_users_query_size_in_trace_output);

				    if (!port && !https_port) {

				        return make_exception_future<>(std::runtime_error("Either regular port or TLS port"

				                " must be specified in order to init an alternator HTTP server instance"));

				@@ -589,23 +882,31 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:

				        if (port) {

				            set_routes(_http_server._routes);

				            _http_server.set_content_length_limit(server::content_length_limit);

				            _http_server.set_content_streaming(true);

				            _http_server.listen(socket_address{addr, *port}).get();

				            _enabled_servers.push_back(std::ref(_http_server));

				        }

				        if (https_port) {

				            set_routes(_https_server._routes);

				            _https_server.set_content_length_limit(server::content_length_limit);

				            _https_server.set_content_streaming(true);

				            auto server_creds = creds->build_reloadable_server_credentials([](const std::unordered_set<sstring>& files, std::exception_ptr ep) {

				                if (ep) {

				                    slogger.warn("Exception loading {}: {}", files, ep);

				                } else {

				                    slogger.info("Reloaded {}", files);

				                }

				            }).get();

				            _https_server.listen(socket_address{addr, *https_port}, std::move(server_creds)).get();

				            if (this_shard_id() == 0) {

				                _credentials = creds->build_reloadable_server_credentials([this](const tls::credentials_builder& b, const std::unordered_set<sstring>& files, std::exception_ptr ep) -> future<> {

				                    if (ep) {

				                        slogger.warn("Exception loading {}: {}", files, ep);

				                    } else {

				                        co_await container().invoke_on_others([&b](server& s) {

				                            if (s._credentials) {

				                                b.rebuild(*s._credentials);

				                            }

				                        });

				                        slogger.info("Reloaded {}", files);

				                    }

				                }).get();

				            } else {

				                _credentials = creds->build_server_credentials();

				            }

				            _https_server.listen(socket_address{addr, *https_port}, _credentials).get();

				            _enabled_servers.push_back(std::ref(_https_server));

				        }

				    });

				@@ -661,6 +962,37 @@ future<> server::json_parser::stop() {

				    return std::move(_run_parse_json_thread);

				}

				// Convert an entry in the server's list of ongoing Alternator requests

				// (_ongoing_requests) into a client_data object. This client_data object

				// will then be used to produce a row for the "system.clients" virtual table.

				client_data server::ongoing_request::make_client_data() const {

				    client_data cd;

				    cd.ct = client_type::alternator;

				    cd.ip = _client_address.addr();

				    cd.port = _client_address.port();

				    cd.shard_id = this_shard_id();

				    cd.connection_stage = client_connection_stage::established;

				    cd.username = _username;

				    cd.scheduling_group_name = _scheduling_group.name();

				    cd.ssl_enabled = _is_https;

				    // For now, we save the full User-Agent header as the "driver name"

				    // and keep "driver_version" unset.

				    cd.driver_name = _user_agent;

				    // Leave "protocol_version" unset, it has no meaning in Alternator.

				    // Leave "hostname", "ssl_protocol" and "ssl_cipher_suite" unset.

				    // As reported in issue #9216, we never set these fields in CQL

				    // either (see cql_server::connection::make_client_data()).

				    return cd;

				}

				future<utils::chunked_vector<client_data>> server::get_client_data() {

				    utils::chunked_vector<client_data> ret;

				    co_await _ongoing_requests.for_each_gently([&ret] (const ongoing_request& r) {

				        ret.emplace_back(r.make_client_data());

				    });

				    co_return ret;

				}

				const char* api_error::what() const noexcept {

				    if (_what_string.empty()) {

				        _what_string = fmt::format("{} {}: {}", std::to_underlying(_http_code), _type, _msg);

									
										38

alternator/server.hh
									
												View File
												
				@@ -9,6 +9,7 @@

				#pragma once

				#include "alternator/executor.hh"

				#include "utils/scoped_item_list.hh"

				#include <seastar/core/future.hh>

				#include <seastar/core/condition-variable.hh>

				#include <seastar/http/httpd.hh>

				@@ -20,12 +21,18 @@

				#include "utils/updateable_value.hh"

				#include <seastar/core/units.hh>

				struct client_data;

				namespace alternator {

				using chunked_content = rjson::chunked_content;

				class server {

				    static constexpr size_t content_length_limit = 16*MB;

				class server : public peering_sharded_service<server> {

				    // The maximum size of a request body that Alternator will accept,

				    // in bytes. This is a safety measure to prevent Alternator from

				    // running out of memory when a client sends a very large request.

				    // DynamoDB also has the same limit set to 16 MB.

				    static constexpr size_t request_content_length_limit = 16*MB;

				    using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,

				            tracing::trace_state_ptr, service_permit, rjson::value, std::unique_ptr<http::request>)>;

				    using alternator_callbacks_map = std::unordered_map<std::string_view, alternator_callback>;

				@@ -40,8 +47,10 @@ class server {

				    key_cache _key_cache;

				    utils::updateable_value<bool> _enforce_authorization;

				    utils::updateable_value<bool> _warn_authorization;

				    utils::updateable_value<uint64_t> _max_users_query_size_in_trace_output;

				    utils::small_vector<std::reference_wrapper<seastar::httpd::http_server>, 2> _enabled_servers;

				    gate _pending_requests;

				    named_gate _pending_requests;

				    // In some places we will need a CQL updateable_timeout_config object even

				    // though it isn't really relevant for Alternator which defines its own

				    // timeouts separately. We can create this object only once.

				@@ -52,6 +61,8 @@ class server {

				    semaphore* _memory_limiter;

				    utils::updateable_value<uint32_t> _max_concurrent_requests;

				    ::shared_ptr<seastar::tls::server_credentials> _credentials;

				    class json_parser {

				        static constexpr size_t yieldable_parsing_threshold = 16*KB;

				        chunked_content _raw_document;

				@@ -72,12 +83,31 @@ class server {

				    };

				    json_parser _json_parser;

				    // The server maintains a list of ongoing requests, that are being handled

				    // by handle_api_request(). It uses this list in get_client_data(), which

				    // is called when reading the "system.clients" virtual table.

				    struct ongoing_request {

				        socket_address _client_address;

				        sstring _user_agent;

				        sstring _username;

				        scheduling_group _scheduling_group;

				        bool _is_https;

				        client_data make_client_data() const;

				    };

				    utils::scoped_item_list<ongoing_request> _ongoing_requests;

				public:

				    server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);

				    future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,

				            utils::updateable_value<bool> enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);

				            utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization, utils::updateable_value<uint64_t> max_users_query_size_in_trace_output,

				            semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);

				    future<> stop();

				    // get_client_data() is called (on each shard separately) when the virtual

				    // table "system.clients" is read. It is expected to generate a list of

				    // clients connected to this server (on this shard). This function is

				    // called by alternator::controller::get_client_data().

				    future<utils::chunked_vector<client_data>> get_client_data();

				private:

				    void set_routes(seastar::httpd::routes& r);

				    // If verification succeeds, returns the authenticated user's username

									
										197

alternator/stats.cc
									
												View File
												
				@@ -9,32 +9,63 @@

				#include "stats.hh"

				#include "utils/histogram_metrics_helper.hh"

				#include <seastar/core/metrics.hh>

				#include "utils/labels.hh"

				namespace alternator {

				const char* ALTERNATOR_METRICS = "alternator";

				static seastar::metrics::histogram estimated_histogram_to_metrics(const utils::estimated_histogram& histogram) {

				    seastar::metrics::histogram res;

				    res.buckets.resize(histogram.bucket_offsets.size());

				    uint64_t cumulative_count = 0;

				    res.sample_count = histogram._count;

				    res.sample_sum = histogram._sample_sum;

				    for (size_t i = 0; i < res.buckets.size(); i++) {

				        auto& v = res.buckets[i];

				        v.upper_bound = histogram.bucket_offsets[i];

				        cumulative_count += histogram.buckets[i];

				        v.count = cumulative_count;

				    }

				    return res;

				}

				static seastar::metrics::label column_family_label("cf");

				static seastar::metrics::label keyspace_label("ks");

				static void register_metrics_with_optional_table(seastar::metrics::metric_groups& metrics, const stats& stats, const sstring& ks, const sstring& table) {

				stats::stats() : api_operations{} {

				    // Register the

				    seastar::metrics::label op("op");

				    _metrics.add_group("alternator", {

				    bool has_table = table.length();

				    std::vector<seastar::metrics::label> aggregate_labels;

				    std::vector<seastar::metrics::label_instance> labels = {alternator_label};

				    sstring group_name = (has_table)? "alternator_table" : "alternator";

				    if (has_table) {

				        labels.push_back(column_family_label(table));

				        labels.push_back(keyspace_label(ks));

				        aggregate_labels.push_back(seastar::metrics::shard_label);

				    }

				    metrics.add_group(group_name, {

				#define OPERATION(name, CamelCaseName) \

				                seastar::metrics::make_total_operations("operation", api_operations.name, \

				                        seastar::metrics::description("number of operations via Alternator API"), {op(CamelCaseName)}).set_skip_when_empty(),

				                seastar::metrics::make_total_operations("operation", stats.api_operations.name, \

				                        seastar::metrics::description("number of operations via Alternator API"), labels)(basic_level)(op(CamelCaseName)).aggregate(aggregate_labels).set_skip_when_empty(),

				#define OPERATION_LATENCY(name, CamelCaseName) \

						metrics.add_group(group_name, { \

				                seastar::metrics::make_histogram("op_latency", \

				                        seastar::metrics::description("Latency histogram of an operation via Alternator API"), {op(CamelCaseName)}, [this]{return to_metrics_histogram(api_operations.name.histogram());}).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(), \

				                        seastar::metrics::description("Latency histogram of an operation via Alternator API"), labels, [&stats]{return to_metrics_histogram(stats.api_operations.name.histogram());})(op(CamelCaseName))(basic_level).aggregate({seastar::metrics::shard_label}).set_skip_when_empty()}); \

				            if (!has_table) {\

				            	metrics.add_group("alternator", { \

								seastar::metrics::make_summary("op_latency_summary", \

										                        seastar::metrics::description("Latency summary of an operation via Alternator API"), [this]{return to_metrics_summary(api_operations.name.summary());})(op(CamelCaseName)).set_skip_when_empty(),

										                        seastar::metrics::description("Latency summary of an operation via Alternator API"), [&stats]{return to_metrics_summary(stats.api_operations.name.summary());})(op(CamelCaseName))(basic_level)(alternator_label).set_skip_when_empty()}); \

				            }

				            OPERATION(batch_get_item, "BatchGetItem")

				            OPERATION(batch_write_item, "BatchWriteItem")

				            OPERATION(create_backup, "CreateBackup")

				            OPERATION(create_global_table, "CreateGlobalTable")

				            OPERATION(create_table, "CreateTable")

				            OPERATION(delete_backup, "DeleteBackup")

				            OPERATION(delete_item, "DeleteItem")

				            OPERATION(delete_table, "DeleteTable")

				            OPERATION(describe_backup, "DescribeBackup")

				            OPERATION(describe_continuous_backups, "DescribeContinuousBackups")

				            OPERATION(describe_endpoints, "DescribeEndpoints")

				@@ -63,55 +94,117 @@ stats::stats() : api_operations{} {

				            OPERATION(update_item, "UpdateItem")

				            OPERATION(update_table, "UpdateTable")

				            OPERATION(update_time_to_live, "UpdateTimeToLive")

				            OPERATION_LATENCY(put_item_latency, "PutItem")

				            OPERATION_LATENCY(get_item_latency, "GetItem")

				            OPERATION_LATENCY(delete_item_latency, "DeleteItem")

				            OPERATION_LATENCY(update_item_latency, "UpdateItem")

				            OPERATION_LATENCY(batch_write_item_latency, "BatchWriteItem")

				            OPERATION_LATENCY(batch_get_item_latency, "BatchGetItem")

				            OPERATION(list_streams, "ListStreams")

				            OPERATION(describe_stream, "DescribeStream")

				            OPERATION(get_shard_iterator, "GetShardIterator")

				            OPERATION(get_records, "GetRecords")

				            OPERATION_LATENCY(get_records_latency, "GetRecords")

				    });

				    _metrics.add_group("alternator", {

				            seastar::metrics::make_total_operations("unsupported_operations", unsupported_operations,

				                    seastar::metrics::description("number of unsupported operations via Alternator API")),

				            seastar::metrics::make_total_operations("total_operations", total_operations,

				                    seastar::metrics::description("number of total operations via Alternator API")),

				            seastar::metrics::make_total_operations("reads_before_write", reads_before_write,

				                    seastar::metrics::description("number of performed read-before-write operations")),

				            seastar::metrics::make_total_operations("write_using_lwt", write_using_lwt,

				                    seastar::metrics::description("number of writes that used LWT")),

				            seastar::metrics::make_total_operations("shard_bounce_for_lwt", shard_bounce_for_lwt,

				                    seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements")),

				            seastar::metrics::make_total_operations("requests_blocked_memory", requests_blocked_memory,

				                    seastar::metrics::description("Counts a number of requests blocked due to memory pressure.")),

				            seastar::metrics::make_total_operations("requests_shed", requests_shed,

				                    seastar::metrics::description("Counts a number of requests shed due to overload.")),

				            seastar::metrics::make_total_operations("filtered_rows_read_total", cql_stats.filtered_rows_read_total,

				                    seastar::metrics::description("number of rows read during filtering operations")),

				            seastar::metrics::make_total_operations("filtered_rows_matched_total", cql_stats.filtered_rows_matched_total,

				                    seastar::metrics::description("number of rows read and matched during filtering operations")),

				            seastar::metrics::make_counter("rcu_total", rcu_total,

				                    seastar::metrics::description("total number of consumed read units, counted as half units")).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::PUT_ITEM],

				                    seastar::metrics::description("total number of consumed write units, counted as half units"),{op("PutItem")}).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::DELETE_ITEM],

				                    seastar::metrics::description("total number of consumed write units, counted as half units"),{op("DeleteItem")}).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::UPDATE_ITEM],

				                    seastar::metrics::description("total number of consumed write units, counted as half units"),{op("UpdateItem")}).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::INDEX],

				                    seastar::metrics::description("total number of consumed write units, counted as half units"),{op("Index")}).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("filtered_rows_dropped_total", [this] { return cql_stats.filtered_rows_read_total - cql_stats.filtered_rows_matched_total; },

				                    seastar::metrics::description("number of rows read and dropped during filtering operations")),

				            seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchWriteItem")},

				                    api_operations.batch_write_item_batch_total).set_skip_when_empty(),

				            seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchGetItem")},

				                    api_operations.batch_get_item_batch_total).set_skip_when_empty(),

				    OPERATION_LATENCY(put_item_latency, "PutItem")

				    OPERATION_LATENCY(get_item_latency, "GetItem")

				    OPERATION_LATENCY(delete_item_latency, "DeleteItem")

				    OPERATION_LATENCY(update_item_latency, "UpdateItem")

				    OPERATION_LATENCY(batch_write_item_latency, "BatchWriteItem")

				    OPERATION_LATENCY(batch_get_item_latency, "BatchGetItem")

				    OPERATION_LATENCY(get_records_latency, "GetRecords")

				    if (!has_table) {

				        // Create and delete operations are not applicable to a per-table metrics

				        // only register it for the global metrics

				        metrics.add_group("alternator", {

				            OPERATION(create_table, "CreateTable")

				            OPERATION(delete_table, "DeleteTable")

				        });

				    }

				    metrics.add_group(group_name, {

				            seastar::metrics::make_total_operations("unsupported_operations", stats.unsupported_operations,

				                    seastar::metrics::description("number of unsupported operations via Alternator API"), labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("total_operations", stats.total_operations,

				                    seastar::metrics::description("number of total operations via Alternator API"), labels)(basic_level).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("reads_before_write", stats.reads_before_write,

				                    seastar::metrics::description("number of performed read-before-write operations"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("write_using_lwt", stats.write_using_lwt,

				                    seastar::metrics::description("number of writes that used LWT"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("shard_bounce_for_lwt", stats.shard_bounce_for_lwt,

				                    seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("requests_blocked_memory", stats.requests_blocked_memory,

				                    seastar::metrics::description("Counts a number of requests blocked due to memory pressure."), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("requests_shed", stats.requests_shed,

				                    seastar::metrics::description("Counts a number of requests shed due to overload."), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("filtered_rows_read_total", stats.cql_stats.filtered_rows_read_total,

				                    seastar::metrics::description("number of rows read during filtering operations"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("filtered_rows_matched_total", stats.cql_stats.filtered_rows_matched_total,

				                    seastar::metrics::description("number of rows read and matched during filtering operations"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("rcu_total", [&stats]{return 0.5 * stats.rcu_half_units_total;},

				                    seastar::metrics::description("total number of consumed read units"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", stats.wcu_total[stats::wcu_types::PUT_ITEM],

				                    seastar::metrics::description("total number of consumed write units"), labels)(op("PutItem")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", stats.wcu_total[stats::wcu_types::DELETE_ITEM],

				                    seastar::metrics::description("total number of consumed write units"), labels)(op("DeleteItem")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", stats.wcu_total[stats::wcu_types::UPDATE_ITEM],

				                    seastar::metrics::description("total number of consumed write units"), labels)(op("UpdateItem")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", stats.wcu_total[stats::wcu_types::INDEX],

				                    seastar::metrics::description("total number of consumed write units"), labels)(op("Index")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("filtered_rows_dropped_total", [&stats] { return stats.cql_stats.filtered_rows_read_total - stats.cql_stats.filtered_rows_matched_total; },

				                    seastar::metrics::description("number of rows read and dropped during filtering operations"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"), labels,

				                    stats.api_operations.batch_write_item_batch_total)(op("BatchWriteItem")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"), labels,

				                    stats.api_operations.batch_get_item_batch_total)(op("BatchGetItem")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,

				                    [&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_get_item_histogram);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,

				                    [&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_write_item_histogram);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,

				                    [&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.get_item_op_size_kb);})(op("GetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,

				                    [&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.put_item_op_size_kb);})(op("PutItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,

				                    [&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.delete_item_op_size_kb);})(op("DeleteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,

				                    [&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.update_item_op_size_kb);})(op("UpdateItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,

				                    [&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.batch_get_item_op_size_kb);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,

				                    [&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.batch_write_item_op_size_kb);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				    });

				    seastar::metrics::label expression_label("expression");

				    metrics.add_group(group_name, {

				            seastar::metrics::make_total_operations("expression_cache_evictions", stats.expression_cache.evictions,

				                    seastar::metrics::description("Counts number of entries evicted from expressions cache"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("expression_cache_hits", stats.expression_cache.requests[stats::expression_types::UPDATE_EXPRESSION].hits,

				                    seastar::metrics::description("Counts number of hits of cached expressions"), labels)(expression_label("UpdateExpression")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("expression_cache_misses", stats.expression_cache.requests[stats::expression_types::UPDATE_EXPRESSION].misses,

				                    seastar::metrics::description("Counts number of misses of cached expressions"), labels)(expression_label("UpdateExpression")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("expression_cache_hits", stats.expression_cache.requests[stats::expression_types::CONDITION_EXPRESSION].hits,

				                    seastar::metrics::description("Counts number of hits of cached expressions"), labels)(expression_label("ConditionExpression")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("expression_cache_misses", stats.expression_cache.requests[stats::expression_types::CONDITION_EXPRESSION].misses,

				                    seastar::metrics::description("Counts number of misses of cached expressions"), labels)(expression_label("ConditionExpression")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("expression_cache_hits", stats.expression_cache.requests[stats::expression_types::PROJECTION_EXPRESSION].hits,

				                    seastar::metrics::description("Counts number of hits of cached expressions"), labels)(expression_label("ProjectionExpression")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("expression_cache_misses", stats.expression_cache.requests[stats::expression_types::PROJECTION_EXPRESSION].misses,

				                    seastar::metrics::description("Counts number of misses of cached expressions"), labels)(expression_label("ProjectionExpression")).aggregate(aggregate_labels).set_skip_when_empty()

				    });

				    // Only register the following metrics for the global metrics, not per-table

				    if (!has_table) {

				        metrics.add_group("alternator", {

				            seastar::metrics::make_counter("authentication_failures", stats.authentication_failures,

				                seastar::metrics::description("total number of authentication failures"), labels).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_counter("authorization_failures", stats.authorization_failures,

				                seastar::metrics::description("total number of authorization failures"), labels).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				        });

				    }

				}

				void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats) {

				    register_metrics_with_optional_table(metrics, stats, "", "");

				}

				table_stats::table_stats(const sstring& ks, const sstring& table) {

				    _stats = make_lw_shared<stats>();

				    register_metrics_with_optional_table(_metrics, *_stats, ks, table);

				}

				}

									
										75

alternator/stats.hh
									
												View File
												
				@@ -12,6 +12,7 @@

				#include <seastar/core/metrics_registration.hh>

				#include "utils/histogram.hh"

				#include "utils/estimated_histogram.hh"

				#include "cql3/stats.hh"

				namespace alternator {

				@@ -21,7 +22,6 @@ namespace alternator {

				// visible by the metrics REST API, with the "alternator" prefix.

				class stats {

				public:

				    stats();

				    // Count of DynamoDB API operations by types

				    struct {

				        uint64_t batch_get_item = 0;

				@@ -75,7 +75,47 @@ public:

				        utils::timed_rate_moving_average_summary_and_histogram batch_write_item_latency;

				        utils::timed_rate_moving_average_summary_and_histogram batch_get_item_latency;

				        utils::timed_rate_moving_average_summary_and_histogram get_records_latency;

				        utils::estimated_histogram batch_get_item_histogram{22}; // a histogram that covers the range 1 - 100

				        utils::estimated_histogram batch_write_item_histogram{22}; // a histogram that covers the range 1 - 100

				    } api_operations;

				    // Operation size metrics

				    struct {

				        // Item size statistics collected per table and aggregated per node.

				        // Each histogram covers the range 0 - 446. Resolves #25143.

				        // A size is the retrieved item's size.

				        utils::estimated_histogram get_item_op_size_kb{30};

				        // A size is the maximum of the new item's size and the old item's size.

				        utils::estimated_histogram put_item_op_size_kb{30};

				        // A size is the deleted item's size. If the deleted item's size is

				        // unknown (i.e. read-before-write wasn't necessary and it wasn't

				        // forced by a configuration option), it won't be recorded on the

				        // histogram.

				        utils::estimated_histogram delete_item_op_size_kb{30};

				        // A size is the maximum of existing item's size and the estimated size

				        // of the update. This will be changed to the maximum of the existing item's

				        // size and the new item's size in a subsequent PR.

				        utils::estimated_histogram update_item_op_size_kb{30};

				        // A size is the sum of the sizes of all items per table. This means

				        // that a single BatchGetItem / BatchWriteItem updates the histogram

				        // for each table that it has items in.

				        // The sizes are the retrieved items' sizes grouped per table.

				        utils::estimated_histogram batch_get_item_op_size_kb{30};

				        // The sizes are the the written items' sizes grouped per table.

				        utils::estimated_histogram batch_write_item_op_size_kb{30};

				    } operation_sizes;

				    // Count of authentication and authorization failures, counted if either

				    // alternator_enforce_authorization or alternator_warn_authorization are

				    // set to true. If both are false, no authentication or authorization

				    // checks are performed, so failures are not recognized or counted.

				    // "authentication" failure means the request was not signed with a valid

				    // user and key combination. "authorization" failure means the request was

				    // authenticated to a valid user - but this user did not have permissions

				    // to perform the operation (considering RBAC settings and the user's

				    // superuser status).

				    uint64_t authentication_failures = 0;

				    uint64_t authorization_failures = 0;

				    // Miscellaneous event counters

				    uint64_t total_operations = 0;

				    uint64_t unsupported_operations = 0;

				@@ -84,7 +124,7 @@ public:

				    uint64_t shard_bounce_for_lwt = 0;

				    uint64_t requests_blocked_memory = 0;

				    uint64_t requests_shed = 0;

				    uint64_t rcu_total = 0;

				    uint64_t rcu_half_units_total = 0;

				    // wcu can results from put, update, delete and index

				    // Index related will be done on top of the operation it comes with

				    enum wcu_types {

				@@ -98,10 +138,33 @@ public:

				    uint64_t wcu_total[NUM_TYPES] = {0};

				    // CQL-derived stats

				    cql3::cql_stats cql_stats;

				private:

				    // The metric_groups object holds this stat object's metrics registered

				    // as long as the stats object is alive.

				    seastar::metrics::metric_groups _metrics;

				    // Enumeration of expression types only for stats

				    // if needed it can be extended e.g. per operation 

				    enum expression_types {

				        UPDATE_EXPRESSION,

				        CONDITION_EXPRESSION,

				        PROJECTION_EXPRESSION,

				        NUM_EXPRESSION_TYPES

				    };

				    struct {

				        struct {

				            uint64_t hits = 0;

				            uint64_t misses = 0;

				        } requests[NUM_EXPRESSION_TYPES];

				        uint64_t evictions = 0;

				    } expression_cache;

				};

				struct table_stats {

				    table_stats(const sstring& ks, const sstring& table);

				    seastar::metrics::metric_groups _metrics;

				    lw_shared_ptr<stats> _stats;

				};

				void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats);

				inline uint64_t bytes_to_kb_ceil(uint64_t bytes) {

				    return (bytes + 1023) / 1024;

				}

				}

									
										69

alternator/streams.cc
									
												View File
												
				@@ -13,7 +13,6 @@

				#include <seastar/json/formatter.hh>

				#include "auth/permission.hh"

				#include "db/config.hh"

				#include "cdc/log.hh"

				@@ -32,6 +31,7 @@

				#include "executor.hh"

				#include "data_dictionary/data_dictionary.hh"

				#include "utils/rjson.hh"

				/**

				 * Base template type to implement  rapidjson::internal::TypeHelper<...>:s

				@@ -126,7 +126,7 @@ public:

				    }

				};

				}

				} // namespace alternator

				template<typename ValueType>

				struct rapidjson::internal::TypeHelper<ValueType, alternator::stream_arn>

				@@ -217,7 +217,7 @@ future<alternator::executor::request_return_type> alternator::executor::list_str

				        rjson::add(ret, "LastEvaluatedStreamArn", *last);

				    }

				    return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				    return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				}

				struct shard_id {

				@@ -296,7 +296,7 @@ sequence_number::sequence_number(std::string_view v)

				    }())

				{}

				}

				} // namespace alternator

				template<typename ValueType>

				struct rapidjson::internal::TypeHelper<ValueType, alternator::shard_id>

				@@ -356,7 +356,7 @@ static stream_view_type cdc_options_to_steam_view_type(const cdc::options& opts)

				    return type;

				}

				}

				} // namespace alternator

				template<typename ValueType>

				struct rapidjson::internal::TypeHelper<ValueType, alternator::stream_view_type>

				@@ -475,10 +475,10 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				        } else {

				            status = "ENABLED";

				        }

				    } 

				    }

				    auto ttl = std::chrono::seconds(opts.ttl());

				    rjson::add(stream_desc, "StreamStatus", rjson::from_string(status));

				    stream_view_type type = cdc_options_to_steam_view_type(opts);

				@@ -491,7 +491,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				    if (!opts.enabled()) {

				        rjson::add(ret, "StreamDescription", std::move(stream_desc));

				        return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				        return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				    }

				    // TODO: label

				@@ -617,7 +617,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				        rjson::add(stream_desc, "Shards", std::move(shards));

				        rjson::add(ret, "StreamDescription", std::move(stream_desc));

				        return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				        return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				    });

				}

				@@ -714,7 +714,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&

				    auto type = rjson::get<shard_iterator_type>(request, "ShardIteratorType");

				    auto seq_num = rjson::get_opt<sequence_number>(request, "SequenceNumber");

				    if (type < shard_iterator_type::TRIM_HORIZON && !seq_num) {

				        throw api_error::validation("Missing required parameter \"SequenceNumber\"");

				    }

				@@ -724,7 +724,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&

				    auto stream_arn = rjson::get<alternator::stream_arn>(request, "StreamArn");

				    auto db = _proxy.data_dictionary();

				    schema_ptr schema = nullptr;

				    std::optional<shard_id> sid;

				@@ -770,7 +770,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&

				    auto ret = rjson::empty_object();

				    rjson::add(ret, "ShardIterator", iter);

				    return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				    return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				}

				struct event_id {

				@@ -789,7 +789,7 @@ struct event_id {

				        return os;

				    }

				};

				}

				} // namespace alternator

				template<typename ValueType>

				struct rapidjson::internal::TypeHelper<ValueType, alternator::event_id>

				@@ -808,6 +808,9 @@ future<executor::request_return_type> executor::get_records(client_state& client

				    if (limit < 1) {

				        throw api_error::validation("Limit must be 1 or more");

				    }

				    if (limit > 1000) {

				        throw api_error::validation("Limit must be less than or equal to 1000");

				    }

				    auto db = _proxy.data_dictionary();

				    schema_ptr schema, base;

				@@ -824,7 +827,7 @@ future<executor::request_return_type> executor::get_records(client_state& client

				    tracing::add_table_name(trace_state, schema->ks_name(), schema->cf_name());

				    co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::SELECT);

				    co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::SELECT, _stats);

				    db::consistency_level cl = db::consistency_level::LOCAL_QUORUM;

				    partition_key pk = iter.shard.id.to_partition_key(*schema);

				@@ -868,10 +871,12 @@ future<executor::request_return_type> executor::get_records(client_state& client

				    std::transform(pks.begin(), pks.end(), std::back_inserter(columns), [](auto& c) { return &c; });

				    std::transform(cks.begin(), cks.end(), std::back_inserter(columns), [](auto& c) { return &c; });

				    auto regular_column_start_idx = columns.size();

				    auto regular_column_filter = std::views::filter([](const column_definition& cdef) { return cdef.name() == op_column_name || cdef.name() == eor_column_name || !cdc::is_cdc_metacolumn_name(cdef.name_as_text()); });

				    std::ranges::transform(schema->regular_columns() | regular_column_filter, std::back_inserter(columns), [](auto& c) { return &c; });

				    auto regular_columns = schema->regular_columns()

				        | std::views::filter([](const column_definition& cdef) { return cdef.name() == op_column_name || cdef.name() == eor_column_name || !cdc::is_cdc_metacolumn_name(cdef.name_as_text()); })

				        | std::views::transform([&] (const column_definition& cdef) { columns.emplace_back(&cdef); return cdef.id; })

				    auto regular_columns = std::ranges::subrange(columns.begin() + regular_column_start_idx, columns.end())

				        | std::views::transform(&column_definition::id)

				        | std::ranges::to<query::column_id_vector>()

				    ;

				@@ -922,6 +927,7 @@ future<executor::request_return_type> executor::get_records(client_state& client

				        std::optional<utils::UUID> timestamp;

				        auto dynamodb = rjson::empty_object();

				        auto record = rjson::empty_object();

				        const auto dc_name = _proxy.get_token_metadata_ptr()->get_topology().get_datacenter();

				        using op_utype = std::underlying_type_t<cdc::operation>;

				@@ -931,9 +937,10 @@ future<executor::request_return_type> executor::get_records(client_state& client

				                dynamodb = rjson::empty_object();

				            }

				            if (!record.ObjectEmpty()) {

				                // TODO: awsRegion?

				                rjson::add(record, "awsRegion", rjson::from_string(dc_name));

				                rjson::add(record, "eventID", event_id(iter.shard.id, *timestamp));

				                rjson::add(record, "eventSource", "scylladb:alternator");

				                rjson::add(record, "eventVersion", "1.1");

				                rjson::push_back(records, std::move(record));

				                record = rjson::empty_object();

				                --limit;

				@@ -952,7 +959,7 @@ future<executor::request_return_type> executor::get_records(client_state& client

				                rjson::add(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());

				                rjson::add(dynamodb, "SequenceNumber", sequence_number(ts));

				                rjson::add(dynamodb, "StreamViewType", type);

				                //TODO: SizeInBytes

				                // TODO: SizeBytes

				            }

				            /**

				@@ -992,6 +999,16 @@ future<executor::request_return_type> executor::get_records(client_state& client

				            case cdc::operation::insert:

				                rjson::add(record, "eventName", "INSERT");

				                break;

				            case cdc::operation::service_row_delete:

				            case cdc::operation::service_partition_delete:

				            {

				                auto user_identity = rjson::empty_object();

				                rjson::add(user_identity, "Type", "Service");

				                rjson::add(user_identity, "PrincipalId", "dynamodb.amazonaws.com");

				                rjson::add(record, "userIdentity", std::move(user_identity));

				                rjson::add(record, "eventName", "REMOVE");

				                break;

				            }

				            default:

				                rjson::add(record, "eventName", "REMOVE");

				                break;

				@@ -1018,7 +1035,7 @@ future<executor::request_return_type> executor::get_records(client_state& client

				            // will notice end end of shard and not return NextShardIterator.

				            rjson::add(ret, "NextShardIterator", next_iter);

				            _stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);

				            return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				            return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				        }

				        // ugh. figure out if we are and end-of-shard

				@@ -1044,21 +1061,19 @@ future<executor::request_return_type> executor::get_records(client_state& client

				            if (is_big(ret)) {

				                return make_ready_future<executor::request_return_type>(make_streamed(std::move(ret)));

				            }

				            return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				            return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				        });

				    });

				}

				void executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp) {

				bool executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp) {

				    auto stream_enabled = rjson::find(stream_specification, "StreamEnabled");

				    if (!stream_enabled || !stream_enabled->IsBool()) {

				        throw api_error::validation("StreamSpecification needs boolean StreamEnabled");

				    }

				    if (stream_enabled->GetBool()) {

				        auto db = sp.data_dictionary();

				        if (!db.features().alternator_streams) {

				        if (!sp.features().alternator_streams) {

				            throw api_error::validation("StreamSpecification: alternator streams feature not enabled in cluster.");

				        }

				@@ -1083,10 +1098,12 @@ void executor::add_stream_options(const rjson::value& stream_specification, sche

				                break;

				        }

				        builder.with_cdc_options(opts);

				        return true;

				    } else {

				        cdc::options opts;

				        opts.enabled(false);

				        builder.with_cdc_options(opts);

				        return false;

				    }

				}

				@@ -1115,4 +1132,4 @@ void executor::supplement_table_stream_info(rjson::value& descr, const schema& s

				    }

				}

				}

				} // namespace alternator

									
										154

alternator/ttl.cc
									
												View File
												
				@@ -16,8 +16,8 @@

				#include <seastar/core/future.hh>

				#include <seastar/core/lowres_clock.hh>

				#include <seastar/coroutine/maybe_yield.hh>

				#include <boost/multiprecision/cpp_int.hpp>

				#include "cdc/log.hh"

				#include "exceptions/exceptions.hh"

				#include "gms/gossiper.hh"

				#include "gms/inet_address.hh"

				@@ -28,7 +28,7 @@

				#include "replica/database.hh"

				#include "service/client_state.hh"

				#include "service_permit.hh"

				#include "timestamp.hh"

				#include "mutation/timestamp.hh"

				#include "service/storage_proxy.hh"

				#include "service/pager/paging_state.hh"

				#include "service/pager/query_pagers.hh"

				@@ -49,6 +49,7 @@

				#include "dht/sharder.hh"

				#include "db/config.hh"

				#include "db/tags/utils.hh"

				#include "utils/labels.hh"

				#include "ttl.hh"

				@@ -56,18 +57,18 @@ static logging::logger tlogger("alternator_ttl");

				namespace alternator {

				// We write the expiration-time attribute enabled on a table using a

				// We write the expiration-time attribute enabled on a table in a

				// tag TTL_TAG_KEY.

				// Currently, the *value* of this tag is simply the name of the attribute,

				// and the expiration scanner interprets it as an Alternator attribute name -

				// It can refer to a real column or if that doesn't exist, to a member of

				// the ":attrs" map column. Although this is designed for Alternator, it may

				// be good enough for CQL as well (there, the ":attrs" column won't exist).

				static const sstring TTL_TAG_KEY("system:ttl_attribute");

				extern const sstring TTL_TAG_KEY;

				future<executor::request_return_type> executor::update_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {

				    _stats.api_operations.update_time_to_live++;

				    if (!_proxy.data_dictionary().features().alternator_ttl) {

				    if (!_proxy.features().alternator_ttl) {

				        co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator-ttl' experimental feature is enabled on all nodes.");

				    }

				@@ -81,11 +82,6 @@ future<executor::request_return_type> executor::update_time_to_live(client_state

				        co_return api_error::validation("UpdateTimeToLive requires boolean Enabled");

				    }

				    bool enabled = v->GetBool();

				    // Alternator TTL doesn't yet work when the table uses tablets (#16567)

				    if (enabled && _proxy.local_db().find_keyspace(schema->ks_name()).get_replication_strategy().uses_tablets()) {

				        co_return api_error::validation("TTL not yet supported on a table using tablets (issue #16567). "

				            "Create a table with the tag 'experimental:initial_tablets' set to 'none' to use vnodes.");

				    }

				    v = rjson::find(*spec, "AttributeName");

				    if (!v || !v->IsString()) {

				        co_return api_error::validation("UpdateTimeToLive requires string AttributeName");

				@@ -99,7 +95,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state

				    }

				    sstring attribute_name(v->GetString(), v->GetStringLength());

				    co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::ALTER);

				    co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::ALTER, _stats);

				    co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [&](std::map<sstring, sstring>& tags_map) {

				        if (enabled) {

				            if (tags_map.contains(TTL_TAG_KEY)) {

				@@ -123,7 +119,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state

				    // basically identical to the request's

				    rjson::value response = rjson::empty_object();

				    rjson::add(response, "TimeToLiveSpecification", std::move(*spec));

				    co_return make_jsonable(std::move(response));

				    co_return rjson::print(std::move(response));

				}

				future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {

				@@ -140,7 +136,7 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta

				    }

				    rjson::value response = rjson::empty_object();

				    rjson::add(response, "TimeToLiveDescription", std::move(desc));

				    co_return make_jsonable(std::move(response));

				    co_return rjson::print(std::move(response));

				}

				// expiration_service is a sharded service responsible for cleaning up expired

				@@ -291,13 +287,18 @@ static future<> expire_item(service::storage_proxy& proxy,

				        auto ck = clustering_key::from_exploded(exploded_ck);

				        m.partition().clustered_row(*schema, ck).apply(tombstone(ts, gc_clock::now()));

				    }

				    std::vector<mutation> mutations;

				    utils::chunked_vector<mutation> mutations;

				    mutations.push_back(std::move(m));

				    return proxy.mutate(std::move(mutations),

				        db::consistency_level::LOCAL_QUORUM,

				        executor::default_timeout(), // FIXME - which timeout?

				        qs.get_trace_state(), qs.get_permit(),

				        db::allow_per_partition_rate_limit::no);

				        db::allow_per_partition_rate_limit::no,

				        false,

				        cdc::per_request_options{

				            .is_system_originated = true,

				        }

				    );

				}

				static size_t random_offset(size_t min, size_t max) {

				@@ -315,8 +316,10 @@ static size_t random_offset(size_t min, size_t max) {

				// this range's primary node is down. For this we need to return not just

				// a list of this node's secondary ranges - but also the primary owner of

				// each of those ranges.

				//

				// The function is to be used with vnodes only

				static future<std::vector<std::pair<dht::token_range, locator::host_id>>> get_secondary_ranges(

				        const locator::effective_replication_map_ptr& erm,

				        const locator::effective_replication_map* erm,

				        locator::host_id ep) {

				    const auto& tm = *erm->get_token_metadata_ptr();

				    const auto& sorted_tokens = tm.sorted_tokens();

				@@ -327,6 +330,7 @@ static future<std::vector<std::pair<dht::token_range, locator::host_id>>> get_se

				    auto prev_tok = sorted_tokens.back();

				    for (const auto& tok : sorted_tokens) {

				        co_await coroutine::maybe_yield();

				        // FIXME: pass is_vnode=true to get_natural_replicas since the token is in tm.sorted_tokens()

				        host_id_vector_replica_set eps = erm->get_natural_replicas(tok);

				        if (eps.size() <= 1 || eps[1] != ep) {

				            prev_tok = tok;

				@@ -396,7 +400,7 @@ class ranges_holder_primary {

				    dht::token_range_vector _token_ranges;

				public:

				    explicit ranges_holder_primary(dht::token_range_vector token_ranges) : _token_ranges(std::move(token_ranges)) {}

				    static future<ranges_holder_primary> make(const locator::vnode_effective_replication_map_ptr& erm, locator::host_id ep) {

				    static future<ranges_holder_primary> make(const locator::vnode_effective_replication_map* erm, locator::host_id ep) {

				        co_return ranges_holder_primary(co_await erm->get_primary_ranges(ep));

				    }

				    std::size_t size() const { return _token_ranges.size(); }

				@@ -416,7 +420,7 @@ public:

				    explicit ranges_holder_secondary(std::vector<std::pair<dht::token_range, locator::host_id>> token_ranges, const gms::gossiper& g)

				        : _token_ranges(std::move(token_ranges))

				        , _gossiper(g) {}

				    static future<ranges_holder_secondary> make(const locator::effective_replication_map_ptr& erm, locator::host_id ep, const gms::gossiper& g) {

				    static future<ranges_holder_secondary> make(const locator::vnode_effective_replication_map* erm, locator::host_id ep, const gms::gossiper& g) {

				        co_return ranges_holder_secondary(co_await get_secondary_ranges(erm, ep), g);

				    }

				    std::size_t size() const { return _token_ranges.size(); }

				@@ -429,6 +433,8 @@ public:

				    }

				};

				// The token_ranges_owned_by_this_shard class is only used for vnodes, where the vnodes give a partition range for the entire node

				// and such range still needs to be divided between the shards.

				template<class primary_or_secondary_t>

				class token_ranges_owned_by_this_shard {

				    schema_ptr _s;

				@@ -522,7 +528,7 @@ struct scan_ranges_context {

				        // should be possible (and a must for issue #7751!).

				        lw_shared_ptr<service::pager::paging_state> paging_state = nullptr;

				        auto regular_columns =

				            s->regular_columns() | std::views::transform([] (const column_definition& cdef) { return cdef.id; })

				            s->regular_columns() | std::views::transform(&column_definition::id)

				            | std::ranges::to<query::column_id_vector>();

				        selection = cql3::selection::selection::wildcard(s);

				        query::partition_slice::option_set opts = selection->get_query_options();

				@@ -655,6 +661,17 @@ static future<> scan_table_ranges(

				    }

				}

				static future<> scan_tablet(locator::tablet_id tablet, service::storage_proxy& proxy, abort_source& abort_source, named_semaphore& page_sem,

				            expiration_service::stats& expiration_stats, const scan_ranges_context& scan_ctx, const locator::tablet_map& tablet_map) {

				    auto tablet_token_range = tablet_map.get_token_range(tablet);

				    dht::ring_position tablet_start(tablet_token_range.start()->value(), dht::ring_position::token_bound::start),

				                       tablet_end(tablet_token_range.end()->value(), dht::ring_position::token_bound::end);

				    auto partition_range = dht::partition_range::make(std::move(tablet_start), std::move(tablet_end));

				    // Note that because of issue #9167 we need to run a separate query on each partition range, and can't pass

				    // several of them into one partition_range_vector that is passed to scan_table_ranges().

				    return scan_table_ranges(proxy, scan_ctx, {partition_range}, abort_source, page_sem, expiration_stats);

				}

				// scan_table() scans, in one table, data "owned" by this shard, looking for

				// expired items and deleting them.

				// We consider each node to "own" its primary token ranges, i.e., the tokens

				@@ -730,34 +747,69 @@ static future<bool> scan_table(

				    expiration_stats.scan_table++;

				    // FIXME: need to pace the scan, not do it all at once.

				    scan_ranges_context scan_ctx{s, proxy, std::move(column_name), std::move(member)};

				    auto erm = db.real_database().find_keyspace(s->ks_name()).get_vnode_effective_replication_map();

				    auto my_host_id = erm->get_topology().my_host_id();

				    token_ranges_owned_by_this_shard my_ranges(s, co_await ranges_holder_primary::make(erm, my_host_id));

				    while (std::optional<dht::partition_range> range = my_ranges.next_partition_range()) {

				        // Note that because of issue #9167 we need to run a separate

				        // query on each partition range, and can't pass several of

				        // them into one partition_range_vector.

				        dht::partition_range_vector partition_ranges;

				        partition_ranges.push_back(std::move(*range));

				        // FIXME: if scanning a single range fails, including network errors,

				        // we fail the entire scan (and rescan from the beginning). Need to

				        // reconsider this. Saving the scan position might be a good enough

				        // solution for this problem.

				        co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);

				    }

				    // If each node only scans its own primary ranges, then when any node is

				    // down part of the token range will not get scanned. This can be viewed

				    // as acceptable (when the comes back online, it will resume its scan),

				    // but as noted in issue #9787, we can allow more prompt expiration

				    // by tasking another node to take over scanning of the dead node's primary

				    // ranges. What we do here is that this node will also check expiration

				    // on its *secondary* ranges - but only those whose primary owner is down.

				    token_ranges_owned_by_this_shard my_secondary_ranges(s, co_await ranges_holder_secondary::make(erm, my_host_id, gossiper));

				    while (std::optional<dht::partition_range> range = my_secondary_ranges.next_partition_range()) {

				        expiration_stats.secondary_ranges_scanned++;

				        dht::partition_range_vector partition_ranges;

				        partition_ranges.push_back(std::move(*range));

				        co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);

				    if (s->table().uses_tablets()) {

				        locator::effective_replication_map_ptr erm = s->table().get_effective_replication_map();

				        auto my_host_id = erm->get_topology().my_host_id();

				        const auto &tablet_map = erm->get_token_metadata().tablets().get_tablet_map(s->id());

				        for (std::optional tablet = tablet_map.first_tablet(); tablet; tablet = tablet_map.next_tablet(*tablet)) {

				            auto tablet_primary_replica = tablet_map.get_primary_replica(*tablet, erm->get_topology());

				            // check if this is the primary replica for the current tablet

				            if (tablet_primary_replica.host == my_host_id && tablet_primary_replica.shard == this_shard_id()) {

				                co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);

				            } else if(erm->get_replication_factor() > 1) {

				                // Check if this is the secondary replica for the current tablet

				                // and if the primary replica is down which means we will take over this work.

				                // If each node only scans its own primary ranges, then when any node is

				                // down part of the token range will not get scanned. This can be viewed

				                // as acceptable (when the comes back online, it will resume its scan),

				                // but as noted in issue #9787, we can allow more prompt expiration

				                // by tasking another node to take over scanning of the dead node's primary

				                // ranges. What we do here is that this node will also check expiration

				                // on its *secondary* ranges - but only those whose primary owner is down.

				                auto tablet_secondary_replica = tablet_map.get_secondary_replica(*tablet); // throws if no secondary replica

				                if (tablet_secondary_replica.host == my_host_id && tablet_secondary_replica.shard == this_shard_id()) {

				                    if (!gossiper.is_alive(tablet_primary_replica.host)) {

				                        co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);

				                    }

				                }

				            }

				        }

				    } else {  // VNodes

				        locator::static_effective_replication_map_ptr ermp =

				                db.real_database().find_keyspace(s->ks_name()).get_static_effective_replication_map();

				        auto* erm = ermp->maybe_as_vnode_effective_replication_map();

				        if (!erm) {

				            on_internal_error(tlogger, format("Keyspace {} is local", s->ks_name()));

				        }

				        auto my_host_id = erm->get_topology().my_host_id();

				        token_ranges_owned_by_this_shard my_ranges(s, co_await ranges_holder_primary::make(erm, my_host_id));

				        while (std::optional<dht::partition_range> range = my_ranges.next_partition_range()) {

				            // Note that because of issue #9167 we need to run a separate

				            // query on each partition range, and can't pass several of

				            // them into one partition_range_vector.

				            dht::partition_range_vector partition_ranges;

				            partition_ranges.push_back(std::move(*range));

				            // FIXME: if scanning a single range fails, including network errors,

				            // we fail the entire scan (and rescan from the beginning). Need to

				            // reconsider this. Saving the scan position might be a good enough

				            // solution for this problem.

				            co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);

				        }

				        // If each node only scans its own primary ranges, then when any node is

				        // down part of the token range will not get scanned. This can be viewed

				        // as acceptable (when the comes back online, it will resume its scan),

				        // but as noted in issue #9787, we can allow more prompt expiration

				        // by tasking another node to take over scanning of the dead node's primary

				        // ranges. What we do here is that this node will also check expiration

				        // on its *secondary* ranges - but only those whose primary owner is down.

				        token_ranges_owned_by_this_shard my_secondary_ranges(s, co_await ranges_holder_secondary::make(erm, my_host_id, gossiper));

				        while (std::optional<dht::partition_range> range = my_secondary_ranges.next_partition_range()) {

				            expiration_stats.secondary_ranges_scanned++;

				            dht::partition_range_vector partition_ranges;

				            partition_ranges.push_back(std::move(*range));

				            co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);

				        }

				    }

				    co_return true;

				}

				@@ -851,13 +903,13 @@ future<> expiration_service::stop() {

				expiration_service::stats::stats() {

				    _metrics.add_group("expiration", {

				        seastar::metrics::make_total_operations("scan_passes", scan_passes,

				            seastar::metrics::description("number of passes over the database")),

				            seastar::metrics::description("number of passes over the database"))(alternator_label).set_skip_when_empty(),

				        seastar::metrics::make_total_operations("scan_table", scan_table,

				            seastar::metrics::description("number of table scans (counting each scan of each table that enabled expiration)")),

				            seastar::metrics::description("number of table scans (counting each scan of each table that enabled expiration)"))(alternator_label).set_skip_when_empty(),

				        seastar::metrics::make_total_operations("items_deleted", items_deleted,

				            seastar::metrics::description("number of items deleted after expiration")),

				            seastar::metrics::description("number of items deleted after expiration"))(basic_level)(alternator_label).set_skip_when_empty(),

				        seastar::metrics::make_total_operations("secondary_ranges_scanned", secondary_ranges_scanned,

				            seastar::metrics::description("number of token ranges scanned by this node while their primary owner was down")),

				            seastar::metrics::description("number of token ranges scanned by this node while their primary owner was down"))(alternator_label).set_skip_when_empty(),

				    });

				}

									
										5

api/CMakeLists.txt
									
												View File
												
				@@ -42,6 +42,7 @@ set(swagger_files

				  api-doc/messaging_service.json

				  api-doc/metrics.json

				  api-doc/raft.json

				  api-doc/service_levels.json

				  api-doc/storage_proxy.json

				  api-doc/storage_service.json

				  api-doc/stream_manager.json

				@@ -82,6 +83,7 @@ target_sources(api

				    lsa.cc

				    messaging_service.cc

				    raft.cc

				    service_levels.cc

				    storage_proxy.cc

				    storage_service.cc

				    stream_manager.cc

				@@ -104,5 +106,8 @@ target_link_libraries(api

				    wasmtime_bindings

				    absl::headers)

				if (Scylla_USE_PRECOMPILED_HEADER_USE)

				  target_precompile_headers(api REUSE_FROM scylla-precompiled-header)

				endif()

				check_headers(check-headers api

				  GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)

									
										58

api/api-doc/compaction_manager.json
									
												View File
												
				@@ -246,6 +246,24 @@

				            }

				         }

				      },

				      "sstableinfo":{

				         "id":"sstableinfo",

				         "description":"Compacted sstable information",

				         "properties":{

				            "generation":{

				               "type": "string",

				               "description":"Generation of the sstable"

				            },

				            "origin":{

				               "type":"string",

				               "description":"Origin of the sstable"

				            },

				            "size":{

				               "type":"long",

				               "description":"Size of the sstable"

				            }

				         }

				      },

				      "compaction_info" :{

				          "id": "compaction_info",

				          "description":"A key value mapping",

				@@ -327,6 +345,10 @@

				               "type":"string",

				               "description":"The UUID"

				            },

				            "shard_id":{

				               "type":"int",

				               "description":"The shard id the compaction was executed on"

				            },

				            "cf":{

				               "type":"string",

				               "description":"The column family name"

				@@ -335,9 +357,17 @@

				               "type":"string",

				               "description":"The keyspace name"

				            },

				            "compaction_type":{

				               "type":"string",

				               "description":"Type of compaction"

				            },

				            "started_at":{

				               "type":"long",

				               "description":"The time compaction started"

				            },

				            "compacted_at":{

				               "type":"long",

				               "description":"The time of compaction"

				               "description":"The time compaction completed"

				            },

				            "bytes_in":{

				               "type":"long",

				@@ -353,6 +383,32 @@

				                  "type":"row_merged"

				               },

				               "description":"The merged rows"

				            },

				            "sstables_in": {

				               "type":"array",

				               "items":{

				                  "type":"sstableinfo"

				               },

				               "description":"List of input sstables for compaction"

				            },

				            "sstables_out": {

				               "type":"array",

				               "items":{

				                  "type":"sstableinfo"

				               },

				               "description":"List of output sstables from compaction"

				            },

				            "total_tombstone_purge_attempt":{

				               "type":"long",

				               "description":"Total number of tombstone purge attempts"

				            },

				            "total_tombstone_purge_failure_due_to_overlapping_with_memtable":{

				               "type":"long",

				               "description":"Number of tombstone purge failures due to data overlapping with memtables"

				            },

				            "total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable":{

				               "type":"long",

				               "description":"Number of tombstone purge failures due to data overlapping with non-compacting sstables"

				            }

				        }

				      }

									
										8

api/api-doc/gossiper.json
									
												View File
												
				@@ -136,14 +136,6 @@

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  },

				                  {

				                     "name":"unsafe",

				                     "description":"Set to True to perform an unsafe assassination",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            }

									
										56

api/api-doc/service_levels.json
									
										Normal file
									
												View File
												
				@@ -0,0 +1,56 @@

				{

				    "apiVersion":"0.0.1",

				    "swaggerVersion":"1.2",

				    "basePath":"{{Protocol}}://{{Host}}",

				    "resourcePath":"/service_levels",

				    "produces":[

				        "application/json"

				    ],

				    "apis":[

				        {

				            "path":"/service_levels/switch_tenants",

				            "operations":[

				                {

				                    "method":"POST",

				                    "summary":"Switch tenants on all opened connections if needed",

				                    "type":"void",

				                    "nickname":"do_switch_tenants",

				                    "produces":[

				                        "application/json"

				                    ],

				                    "parameters":[]

				                }

				            ]

				        },

				        {

				            "path":"/service_levels/count_connections",

				            "operations":[

				                {

				                    "method":"GET",

				                    "summary":"Count opened CQL connections per scheduling group per user",

				                    "type":"connections_count_map",

				                    "nickname":"count_connections",

				                    "produces":[

				                        "application/json"

				                    ],

				                    "parameters":[]

				                }

				            ]

				        }

				    ],

				    "models":{},

				    "components": {

				        "schemas": {

				          "connections_count_map": {

				            "type": "object",

				            "additionalProperties": {

				              "type": "object",

				              "additionalProperties": {

				                "type": "integer"

				              }

				            }

				          }

				        }

				      }

				}

									
										393

api/api-doc/storage_service.json
									
												View File
												
				@@ -220,6 +220,25 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/nodes/excluded",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Retrieve host ids of nodes which are marked as excluded",

				               "type":"array",

				               "items":{

				                  "type":"string"

				               },

				               "nickname":"get_excluded_nodes",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/nodes/joining",

				         "operations":[

				@@ -594,6 +613,50 @@

				            }

				         ]

				      },

				      {

				         "path": "/storage_service/natural_endpoints/v2/{keyspace}",

				         "operations": [

				            {

				               "method": "GET",

				               "summary":"This method returns the N endpoints that are responsible for storing the specified key i.e for replication. the endpoint responsible for this key",

				               "type": "array",

				               "items": {

				                  "type": "string"

				               },

				               "nickname": "get_natural_endpoints_v2",

				               "produces": [

				                  "application/json"

				               ],

				               "parameters": [

				                  {

				                     "name": "keyspace",

				                     "description": "The keyspace to query about.",

				                     "required": true,

				                     "allowMultiple": false,

				                     "type": "string",

				                     "paramType": "path"

				                  },

				                  {

				                     "name": "cf",

				                     "description": "Column family name.",

				                     "required": true,

				                     "allowMultiple": false,

				                     "type": "string",

				                     "paramType": "query"

				                  },

				                  {

				                     "name": "key_component",

				                     "description": "Each component of the key for which we need to find the endpoint (e.g. ?key_component=part1&key_component=part2).",

				                     "required": true,

				                     "allowMultiple": true,

				                     "type": "string",

				                     "paramType": "query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/cdc_streams_check_and_repair",

				         "operations":[

				@@ -813,6 +876,14 @@

				                          "allowMultiple":false,

				                          "type":"string",

				                          "paramType":"query"

				                      },

				                      {

				                          "name":"move_files",

				                          "description":"Move component files instead of copying them",

				                          "required":false,

				                          "allowMultiple":false,

				                          "type":"boolean",

				                          "paramType":"query"

				                      }

				                  ]

				              }

				@@ -881,6 +952,23 @@

				                          "allowMultiple":false,

				                          "type":"string",

				                          "paramType":"query"

				                      },

				                      {

				                          "name":"scope",

				                          "description":"Defines the set of nodes to which mutations can be streamed",

				                          "required":false,

				                          "allowMultiple":false,

				                          "type":"string",

				                          "paramType":"query",

				                          "enum": ["all", "dc", "rack", "node"]

				                      },

				                      {

				                         "name":"primary_replica_only",

				                         "description":"Load the sstables and stream to the primary replica node within the scope, if one is specified. If not, stream to the global primary replica.",

				                         "required":false,

				                         "allowMultiple":false,

				                         "type":"boolean",

				                         "paramType":"query"

				                      }

				                  ]

				              }

				@@ -967,7 +1055,7 @@

				         ]

				      },

				      {

				         "path":"/storage_service/cleanup_all",

				         "path":"/storage_service/cleanup_all/",

				         "operations":[

				            {

				               "method":"POST",

				@@ -977,6 +1065,30 @@

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                    {

				                     "name":"global",

				                     "description":"true if cleanup of entire cluster is requested",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/mark_node_as_clean",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Mark the node as clean. After that the node will not be considered as needing cleanup during automatic cleanup which is triggered by some topology operations",

				               "type":"void",

				               "nickname":"reset_cleanup_needed",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[]

				            }

				         ]

				@@ -1083,6 +1195,14 @@

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name": "drop_unfixable_sstables",

				                     "description": "When set to true, drop unfixable sstables. Applies only to scrub mode SEGREGATE.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            }

				@@ -1502,6 +1622,30 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/exclude_node",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Marks the node as permanently down (excluded).",

				               "type":"void",

				               "nickname":"exclude_node",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"hosts",

				                     "description":"Comma-separated list of host ids to exclude",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/removal_status",

				         "operations":[

				@@ -1639,38 +1783,6 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/truncate/{keyspace}",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Truncates (deletes) the given columnFamily from the provided keyspace. Calling truncate results in actual deletion of all data in the cluster under the given columnFamily and it will fail unless all hosts are up. All data in the given column family will be deleted, but its definition will not be affected.",

				               "type":"void",

				               "nickname":"truncate",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"keyspace",

				                     "description":"The keyspace",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Column family name",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/keyspaces",

				         "operations":[

				@@ -2159,6 +2271,31 @@

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"skip_cleanup",

				                     "description":"Don't cleanup keys from loaded sstables. Invalid if load_and_stream is true",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"skip_reshape",

				                     "description":"Don't reshape the loaded sstables. Invalid if load_and_stream is true",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"scope",

				                     "description":"Defines the set of nodes to which mutations can be streamed",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query",

				                     "enum": ["all", "dc", "rack", "node"]

				                  }

				               ]

				            }

				@@ -2859,7 +2996,7 @@

				               "nickname":"repair_tablet",

				               "method":"POST",

				               "summary":"Repair a tablet",

				               "type":"void",

				               "type":"tablet_repair_result",

				               "produces":[

				                  "application/json"

				               ],

				@@ -2887,6 +3024,38 @@

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"hosts_filter",

				                     "description":"Repair replicas listed in the comma-separated host_id list.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"dcs_filter",

				                     "description":"Repair replicas listed in the comma-separated DC list",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"await_completion",

				                     "description":"Set true to wait for the repair to complete. Set false to skip waiting for the repair to complete. When the option is not provided, it defaults to false.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"incremental_mode",

				                     "description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				@@ -3018,6 +3187,73 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/retrain_dict",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Retrain the SSTable compression dictionary for the target table.",

				               "type":"void",

				               "nickname":"retrain_dict",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"keyspace",

				                     "description":"Name of the keyspace containing the target table.",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Name of the target table.",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/estimate_compression_ratios",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Compute an estimated compression ratio for SSTables of the given table, for various compression configurations.",

				               "type":"array",

				               "items":{

				                  "type":"compression_config_result"

				               },

				               "nickname":"estimate_compression_ratios",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"keyspace",

				                     "description":"Name of the keyspace containing the target table.",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Name of the target table.",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/raft_topology/reload",

				         "operations":[

				@@ -3060,6 +3296,54 @@

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/raft_topology/cmd_rpc_status",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Get information about currently running topology cmd rpc",

				               "type":"string",

				               "nickname":"raft_topology_get_cmd_status",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/drop_quarantined_sstables",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Drops all quarantined sstables in all keyspaces or specified keyspace and tables",

				               "type":"void",

				               "nickname":"drop_quarantined_sstables",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"keyspace",

				                     "description":"The keyspace name to drop quarantined sstables from.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"tables",

				                     "description":"Comma-separated table names to drop quarantined sstables from.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      }

				   ],

				   "models":{

				@@ -3196,11 +3480,11 @@

				         "properties":{

				            "start_token":{

				               "type":"string",

				               "description":"The range start token"

				               "description":"The range start token (exclusive)"

				            },

				            "end_token":{

				               "type":"string",

				               "description":"The range start token"

				               "description":"The range end token (inclusive)"

				            },

				            "endpoints":{

				               "type":"array",

				@@ -3273,7 +3557,7 @@

				            "version":{

				               "type":"string",

				               "enum":[

				                  "ka", "la", "mc", "md", "me"

				                  "ka", "la", "mc", "md", "me", "ms"

				               ],

				               "description":"SSTable version"

				            },

				@@ -3310,6 +3594,41 @@

				                }

				            }

				        }

				      },

				      "tablet_repair_result":{

				        "id":"tablet_repair_result",

				        "description":"Tablet repair result",

				        "properties":{

				            "tablet_task_id":{

				                "type":"string"

				            }

				        }

				      },

				      "compression_config_result":{

				         "id":"compression_config_result",

				         "description":"Compression ratio estimation result for one config",

				         "properties":{

				            "level":{

				               "type":"long",

				               "description":"The used value of `compression_level`"

				            },

				            "chunk_length_in_kb":{

				               "type":"long",

				               "description":"The used value of `chunk_length_in_kb`"

				            },

				            "dict":{

				               "type":"string",

				               "description":"The used dictionary: `none`, `past` (== current), or `future`"

				            },

				            "sstable_compression":{

				               "type":"string",

				               "description":"The used compressor name (aka `sstable_compression`)"

				            },

				            "ratio":{

				               "type":"float",

				               "description":"The resulting compression ratio (estimated on a random sample of files)"

				            }

				         }

				      }

				   }

				}

									
										44

api/api-doc/task_manager.json
									
												View File
												
				@@ -253,6 +253,30 @@

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/task_manager/drain/{module}",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Drain finished local tasks",

				               "type":"void",

				               "nickname":"drain_tasks",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"module",

				                     "description":"The module to drain",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  }

				               ]

				            }

				         ]

				      }

				   ],

				   "models":{

				@@ -284,7 +308,8 @@

				                  "created",

				                  "running",

				                  "done",

				                  "failed"

				                  "failed",

				                  "suspended"

				               ],

				               "description":"The state of a task"

				            },

				@@ -319,6 +344,18 @@

				            "sequence_number":{

				               "type":"long",

				               "description":"The running sequence number of the task"

				            },

				            "shard":{

				               "type":"long",

				               "description":"The shard the task is running on"

				            },

				            "start_time":{

				               "type":"datetime",

				               "description":"The start time of the task; unspecified (equal to epoch) when state == created"

				            },

				            "end_time":{

				               "type":"datetime",

				               "description":"The end time of the task; unspecified (equal to epoch) when the task is not completed"

				            }

				         }

				      },

				@@ -352,7 +389,8 @@

				                  "created",

				                  "running",

				                  "done",

				                  "failed"

				                  "failed",

				                  "suspended"

				               ],

				               "description":"The state of the task"

				            },

				@@ -409,7 +447,7 @@

				               "description":"The number of units completed so far"

				            },

				            "children_ids":{

				               "type":"array",

				               "type":"chunked_array",

				               "items":{

				                  "type":"task_identity"

				               },

									
										8

api/api-doc/tasks.json
									
												View File
												
				@@ -42,6 +42,14 @@

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"consider_only_existing_data",

				                     "description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            }

									
										92

api/api.cc
									
												View File
												
				@@ -36,6 +36,7 @@

				#include "tasks.hh"

				#include "raft.hh"

				#include "gms/gossip_address_map.hh"

				#include "service_levels.hh"

				logging::logger apilog("api");

				@@ -80,7 +81,7 @@ future<> set_server_init(http_context& ctx) {

				    });

				}

				future<> set_server_config(http_context& ctx, const db::config& cfg) {

				future<> set_server_config(http_context& ctx, db::config& cfg) {

				    auto rb02 = std::make_shared < api_registry_builder20 > (ctx.api_doc, "/v2");

				    return ctx.http_server.set_routes([&ctx, &cfg, rb02](routes& r) {

				        set_config(rb02, ctx, r, cfg, false);

				@@ -136,14 +137,6 @@ future<> unset_load_meter(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_load_meter(ctx, r); });

				}

				future<> set_format_selector(http_context& ctx, db::sstables_format_selector& sel) {

				    return ctx.http_server.set_routes([&ctx, &sel] (routes& r) { set_format_selector(ctx, r, sel); });

				}

				future<> unset_format_selector(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_format_selector(ctx, r); });

				}

				future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader) {

				    return ctx.http_server.set_routes([&ctx, &sst_loader] (routes& r) { set_sstables_loader(ctx, r, sst_loader); });

				}

				@@ -152,8 +145,8 @@ future<> unset_server_sstables_loader(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_sstables_loader(ctx, r); });

				}

				future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb) {

				    return ctx.http_server.set_routes([&ctx, &vb] (routes& r) { set_view_builder(ctx, r, vb); });

				future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g) {

				    return ctx.http_server.set_routes([&ctx, &vb, &g] (routes& r) { set_view_builder(ctx, r, vb, g); });

				}

				future<> unset_server_view_builder(http_context& ctx) {

				@@ -187,8 +180,8 @@ future<> unset_server_snapshot(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_snapshot(ctx, r); });

				}

				future<> set_server_token_metadata(http_context& ctx, sharded<locator::shared_token_metadata>& tm) {

				    return ctx.http_server.set_routes([&ctx, &tm] (routes& r) { set_token_metadata(ctx, r, tm); });

				future<> set_server_token_metadata(http_context& ctx, sharded<locator::shared_token_metadata>& tm, sharded<gms::gossiper>& g) {

				    return ctx.http_server.set_routes([&ctx, &tm, &g] (routes& r) { set_token_metadata(ctx, r, tm, g); });

				}

				future<> unset_server_token_metadata(http_context& ctx) {

				@@ -223,15 +216,22 @@ future<> unset_server_gossip(http_context& ctx) {

				    });

				}

				future<> set_server_column_family(http_context& ctx, sharded<db::system_keyspace>& sys_ks) {

				    return register_api(ctx, "column_family",

				                "The column family API", [&sys_ks] (http_context& ctx, routes& r) {

				                    set_column_family(ctx, r, sys_ks);

				future<> set_server_column_family(http_context& ctx, sharded<replica::database>& db) {

				    co_await register_api(ctx, "column_family",

				                "The column family API", [&db] (http_context& ctx, routes& r) {

				                    set_column_family(ctx, r, db);

				                });

				    co_await register_api(ctx, "cache_service",

				            "The cache service API", [&db] (http_context& ctx, routes& r) {

				                    set_cache_service(ctx, db, r);

				                });

				}

				future<> unset_server_column_family(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_column_family(ctx, r); });

				    return ctx.http_server.set_routes([&ctx] (routes& r) {

				        unset_column_family(ctx, r);

				        unset_cache_service(ctx, r);

				    });

				}

				future<> set_server_messaging_service(http_context& ctx, sharded<netw::messaging_service>& ms) {

				@@ -263,19 +263,10 @@ future<> unset_server_stream_manager(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_stream_manager(ctx, r); });

				}

				future<> set_server_cache(http_context& ctx) {

				    return register_api(ctx, "cache_service",

				            "The cache service API", set_cache_service);

				}

				future<> unset_server_cache(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_cache_service(ctx, r); });

				}

				future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& proxy) {

				future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& proxy, sharded<gms::gossiper>& g) {

				    return register_api(ctx, "hinted_handoff",

				                "The hinted handoff API", [&proxy] (http_context& ctx, routes& r) {

				                    set_hinted_handoff(ctx, r, proxy);

				                "The hinted handoff API", [&proxy, &g] (http_context& ctx, routes& r) {

				                    set_hinted_handoff(ctx, r, proxy, g);

				                });

				}

				@@ -283,7 +274,7 @@ future<> unset_hinted_handoff(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_hinted_handoff(ctx, r); });

				}

				future<> set_server_compaction_manager(http_context& ctx, sharded<compaction_manager>& cm) {

				future<> set_server_compaction_manager(http_context& ctx, sharded<compaction::compaction_manager>& cm) {

				    return register_api(ctx, "compaction_manager", "The Compaction manager API", [&cm] (http_context& ctx, routes& r) {

				        set_compaction_manager(ctx, r, cm);

				    });

				@@ -316,13 +307,13 @@ future<> unset_server_commitlog(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_commitlog(ctx, r); });

				}

				future<> set_server_task_manager(http_context& ctx, sharded<tasks::task_manager>& tm, lw_shared_ptr<db::config> cfg) {

				future<> set_server_task_manager(http_context& ctx, sharded<tasks::task_manager>& tm, lw_shared_ptr<db::config> cfg, sharded<gms::gossiper>& gossiper) {

				    auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);

				    return ctx.http_server.set_routes([rb, &ctx, &tm, &cfg = *cfg](routes& r) {

				    return ctx.http_server.set_routes([rb, &ctx, &tm, &cfg = *cfg, &gossiper](routes& r) {

				        rb->register_function(r, "task_manager",

				                "The task manager API");

				        set_task_manager(ctx, r, tm, cfg);

				        set_task_manager(ctx, r, tm, cfg, gossiper);

				    });

				}

				@@ -358,6 +349,12 @@ future<> unset_server_cql_server_test(http_context& ctx) {

				#endif

				future<> set_server_service_levels(http_context &ctx, cql_transport::controller& ctl, sharded<cql3::query_processor>& qp) {

				    return register_api(ctx, "service_levels", "The service levels API", [&ctl, &qp] (http_context& ctx, routes& r) {

				        set_service_levels(ctx, r, ctl, qp);

				    });

				}

				future<> set_server_tasks_compaction_module(http_context& ctx, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& snap_ctl) {

				    auto rb = std::make_shared < api_registry_builder > (ctx.api_doc);

				@@ -384,32 +381,5 @@ future<> unset_server_raft(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_raft(ctx, r); });

				}

				void req_params::process(const request& req) {

				    // Process mandatory parameters

				    for (auto& [name, ent] : params) {

				        if (!ent.is_mandatory) {

				            continue;

				        }

				        try {

				            ent.value = req.get_path_param(name);

				        } catch (std::out_of_range&) {

				            throw httpd::bad_param_exception(fmt::format("Mandatory parameter '{}' was not provided", name));

				        }

				    }

				    // Process optional parameters

				    for (auto& [name, value] : req.query_parameters) {

				        try {

				            auto& ent = params.at(name);

				            if (ent.is_mandatory) {

				                throw httpd::bad_param_exception(fmt::format("Parameter '{}' is expected to be provided as part of the request url", name));

				            }

				            ent.value = value;

				        } catch (std::out_of_range&) {

				            throw httpd::bad_param_exception(fmt::format("Unsupported optional parameter '{}'", name));

				        }

				    }

				}

				}

									
										91

api/api.hh
									
												View File
												
				@@ -23,17 +23,6 @@

				namespace api {

				template<class T>

				std::vector<sstring> container_to_vec(const T& container) {

				    std::vector<sstring> res;

				    res.reserve(std::size(container));

				    for (const auto& i : container) {

				        res.push_back(fmt::to_string(i));

				    }

				    return res;

				}

				template<class T>

				std::vector<T> map_to_key_value(const std::map<sstring, sstring>& map) {

				    std::vector<T> res;

				@@ -67,17 +56,6 @@ T map_sum(T&& dest, const S& src) {

				    return std::move(dest);

				}

				template <typename MAP>

				std::vector<sstring> map_keys(const MAP& map) {

				    std::vector<sstring> res;

				    res.reserve(std::size(map));

				    for (const auto& i : map) {

				        res.push_back(fmt::to_string(i.first));

				    }

				    return res;

				}

				/**

				 * General sstring splitting function

				 */

				@@ -95,7 +73,7 @@ inline std::vector<sstring> split(const sstring& text, const char* separator) {

				 *

				 */

				template<class T, class F, class V>

				future<json::json_return_type>  sum_stats(distributed<T>& d, V F::*f) {

				future<json::json_return_type>  sum_stats(sharded<T>& d, V F::*f) {

				    return d.map_reduce0([f](const T& p) {return p.get_stats().*f;}, 0,

				            std::plus<V>()).then([](V val) {

				        return make_ready_future<json::json_return_type>(val);

				@@ -128,7 +106,7 @@ httpd::utils_json::rate_moving_average_and_histogram timer_to_json(const utils::

				}

				template<class T, class F>

				future<json::json_return_type>  sum_histogram_stats(distributed<T>& d, utils::timed_rate_moving_average_and_histogram F::*f) {

				future<json::json_return_type>  sum_histogram_stats(sharded<T>& d, utils::timed_rate_moving_average_and_histogram F::*f) {

				    return d.map_reduce0([f](const T& p) {return (p.get_stats().*f).hist;}, utils::ihistogram(),

				            std::plus<utils::ihistogram>()).then([](const utils::ihistogram& val) {

				@@ -137,7 +115,7 @@ future<json::json_return_type>  sum_histogram_stats(distributed<T>& d, utils::ti

				}

				template<class T, class F>

				future<json::json_return_type>  sum_timer_stats(distributed<T>& d, utils::timed_rate_moving_average_and_histogram F::*f) {

				future<json::json_return_type>  sum_timer_stats(sharded<T>& d, utils::timed_rate_moving_average_and_histogram F::*f) {

				    return d.map_reduce0([f](const T& p) {return (p.get_stats().*f).rate();}, utils::rate_moving_average_and_histogram(),

				            std::plus<utils::rate_moving_average_and_histogram>()).then([](const utils::rate_moving_average_and_histogram& val) {

				@@ -146,7 +124,7 @@ future<json::json_return_type>  sum_timer_stats(distributed<T>& d, utils::timed_

				}

				template<class T, class F>

				future<json::json_return_type>  sum_timer_stats(distributed<T>& d, utils::timed_rate_moving_average_summary_and_histogram F::*f) {

				future<json::json_return_type>  sum_timer_stats(sharded<T>& d, utils::timed_rate_moving_average_summary_and_histogram F::*f) {

				    return d.map_reduce0([f](const T& p) {return (p.get_stats().*f).rate();}, utils::rate_moving_average_and_histogram(),

				            std::plus<utils::rate_moving_average_and_histogram>()).then([](const utils::rate_moving_average_and_histogram& val) {

				        return make_ready_future<json::json_return_type>(timer_to_json(val));

				@@ -252,67 +230,6 @@ public:

				    operator T() const { return value; }

				};

				using mandatory = bool_class<struct mandatory_tag>;

				class req_params {

				public:

				    struct def {

				        std::optional<sstring> value;

				        mandatory is_mandatory = mandatory::no;

				        def(std::optional<sstring> value_ = std::nullopt, mandatory is_mandatory_ = mandatory::no)

				            : value(std::move(value_))

				            , is_mandatory(is_mandatory_)

				        { }

				        def(mandatory is_mandatory_)

				            : is_mandatory(is_mandatory_)

				        { }

				    };

				private:

				    std::unordered_map<sstring, def> params;

				public:

				    req_params(std::initializer_list<std::pair<sstring, def>> l) {

				        for (const auto& [name, ent] : l) {

				            add(std::move(name), std::move(ent));

				        }

				    }

				    void add(sstring name, def ent) {

				        params.emplace(std::move(name), std::move(ent));

				    }

				    void process(const request& req);

				    const std::optional<sstring>& get(const char* name) const {

				        return params.at(name).value;

				    }

				    template <typename T = sstring>

				    const std::optional<T> get_as(const char* name) const {

				        return get(name);

				    }

				    template <typename T = sstring>

				    requires std::same_as<T, bool>

				    const std::optional<bool> get_as(const char* name) const {

				        auto value = get(name);

				        if (!value) {

				            return std::nullopt;

				        }

				        std::transform(value->begin(), value->end(), value->begin(), ::tolower);

				        if (value == "true" || value == "yes" || value == "1") {

				            return true;

				        }

				        if (value == "false" || value == "no" || value == "0") {

				            return false;

				        }

				        throw boost::bad_lexical_cast{};

				    }

				};

				httpd::utils_json::estimated_histogram time_to_json_histogram(const utils::time_estimated_histogram& val);

				}

									
										30

api/api_init.hh
									
												View File
												
				@@ -18,7 +18,9 @@

				using request = http::request;

				using reply = http::reply;

				namespace compaction {

				class compaction_manager;

				}

				namespace service {

				@@ -56,7 +58,6 @@ class sstables_format_selector;

				namespace view {

				class view_builder;

				}

				class system_keyspace;

				}

				namespace netw { class messaging_service; }

				class repair_service;

				@@ -73,22 +74,26 @@ namespace tasks {

				class task_manager;

				}

				namespace cql3 {

				class query_processor;

				}

				namespace api {

				struct http_context {

				    sstring api_dir;

				    sstring api_doc;

				    httpd::http_server_control http_server;

				    distributed<replica::database>& db;

				    sharded<replica::database>& db;

				    http_context(distributed<replica::database>& _db)

				    http_context(sharded<replica::database>& _db)

				            : db(_db)

				    {

				    }

				};

				future<> set_server_init(http_context& ctx);

				future<> set_server_config(http_context& ctx, const db::config& cfg);

				future<> set_server_config(http_context& ctx, db::config& cfg);

				future<> unset_server_config(http_context& ctx);

				future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch);

				future<> unset_server_snitch(http_context& ctx);

				@@ -96,7 +101,7 @@ future<> set_server_storage_service(http_context& ctx, sharded<service::storage_

				future<> unset_server_storage_service(http_context& ctx);

				future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader);

				future<> unset_server_sstables_loader(http_context& ctx);

				future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb);

				future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g);

				future<> unset_server_view_builder(http_context& ctx);

				future<> set_server_repair(http_context& ctx, sharded<repair_service>& repair, sharded<gms::gossip_address_map>& am);

				future<> unset_server_repair(http_context& ctx);

				@@ -108,11 +113,11 @@ future<> set_server_authorization_cache(http_context& ctx, sharded<auth::service

				future<> unset_server_authorization_cache(http_context& ctx);

				future<> set_server_snapshot(http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl);

				future<> unset_server_snapshot(http_context& ctx);

				future<> set_server_token_metadata(http_context& ctx, sharded<locator::shared_token_metadata>& tm);

				future<> set_server_token_metadata(http_context& ctx, sharded<locator::shared_token_metadata>& tm, sharded<gms::gossiper>& g);

				future<> unset_server_token_metadata(http_context& ctx);

				future<> set_server_gossip(http_context& ctx, sharded<gms::gossiper>& g);

				future<> unset_server_gossip(http_context& ctx);

				future<> set_server_column_family(http_context& ctx, sharded<db::system_keyspace>& sys_ks);

				future<> set_server_column_family(http_context& ctx, sharded<replica::database>& db);

				future<> unset_server_column_family(http_context& ctx);

				future<> set_server_messaging_service(http_context& ctx, sharded<netw::messaging_service>& ms);

				future<> unset_server_messaging_service(http_context& ctx);

				@@ -120,14 +125,12 @@ future<> set_server_storage_proxy(http_context& ctx, sharded<service::storage_pr

				future<> unset_server_storage_proxy(http_context& ctx);

				future<> set_server_stream_manager(http_context& ctx, sharded<streaming::stream_manager>& sm);

				future<> unset_server_stream_manager(http_context& ctx);

				future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& p);

				future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& p, sharded<gms::gossiper>& g);

				future<> unset_hinted_handoff(http_context& ctx);

				future<> set_server_cache(http_context& ctx);

				future<> unset_server_cache(http_context& ctx);

				future<> set_server_compaction_manager(http_context& ctx, sharded<compaction_manager>& cm);

				future<> set_server_compaction_manager(http_context& ctx, sharded<compaction::compaction_manager>& cm);

				future<> unset_server_compaction_manager(http_context& ctx);

				future<> set_server_done(http_context& ctx);

				future<> set_server_task_manager(http_context& ctx, sharded<tasks::task_manager>& tm, lw_shared_ptr<db::config> cfg);

				future<> set_server_task_manager(http_context& ctx, sharded<tasks::task_manager>& tm, lw_shared_ptr<db::config> cfg, sharded<gms::gossiper>& gossiper);

				future<> unset_server_task_manager(http_context& ctx);

				future<> set_server_task_manager_test(http_context& ctx, sharded<tasks::task_manager>& tm);

				future<> unset_server_task_manager_test(http_context& ctx);

				@@ -137,10 +140,9 @@ future<> set_server_raft(http_context&, sharded<service::raft_group_registry>&);

				future<> unset_server_raft(http_context&);

				future<> set_load_meter(http_context& ctx, service::load_meter& lm);

				future<> unset_load_meter(http_context& ctx);

				future<> set_format_selector(http_context& ctx, db::sstables_format_selector& sel);

				future<> unset_format_selector(http_context& ctx);

				future<> set_server_cql_server_test(http_context& ctx, cql_transport::controller& ctl);

				future<> unset_server_cql_server_test(http_context& ctx);

				future<> set_server_service_levels(http_context& ctx, cql_transport::controller& ctl, sharded<cql3::query_processor>& qp);

				future<> set_server_commitlog(http_context& ctx, sharded<replica::database>&);

				future<> unset_server_commitlog(http_context& ctx);

									
										30

api/cache_service.cc
									
												View File
												
				@@ -16,7 +16,7 @@ using namespace json;

				using namespace seastar::httpd;

				namespace cs = httpd::cache_service_json;

				void set_cache_service(http_context& ctx, routes& r) {

				void set_cache_service(http_context& ctx, sharded<replica::database>& db, routes& r) {

				    cs::get_row_cache_save_period_in_seconds.set(r, [](std::unique_ptr<http::request> req) {

				        // We never save the cache

				        // Origin uses 0 for never

				@@ -204,53 +204,53 @@ void set_cache_service(http_context& ctx, routes& r) {

				        });

				    });

				    cs::get_row_hits.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [](const replica::column_family& cf) {

				    cs::get_row_hits.set(r, [&db] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(db, uint64_t(0), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.count();

				        }, std::plus<uint64_t>());

				    });

				    cs::get_row_requests.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [](const replica::column_family& cf) {

				    cs::get_row_requests.set(r, [&db] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(db, uint64_t(0), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.count() + cf.get_row_cache().stats().misses.count();

				        }, std::plus<uint64_t>());

				    });

				    cs::get_row_hit_rate.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(ctx, ratio_holder(), [](const replica::column_family& cf) {

				    cs::get_row_hit_rate.set(r, [&db] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(db, ratio_holder(), [](const replica::column_family& cf) {

				            return ratio_holder(cf.get_row_cache().stats().hits.count() + cf.get_row_cache().stats().misses.count(),

				                    cf.get_row_cache().stats().hits.count());

				        }, std::plus<ratio_holder>());

				    });

				    cs::get_row_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const replica::column_family& cf) {

				    cs::get_row_hits_moving_avrage.set(r, [&db] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf_raw(db, utils::rate_moving_average(), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				        });

				    });

				    cs::get_row_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const replica::column_family& cf) {

				    cs::get_row_requests_moving_avrage.set(r, [&db] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf_raw(db, utils::rate_moving_average(), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.rate() + cf.get_row_cache().stats().misses.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				        });

				    });

				    cs::get_row_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				    cs::get_row_size.set(r, [&db] (std::unique_ptr<http::request> req) {

				        // In origin row size is the weighted size.

				        // We currently do not support weights, so we use raw size in bytes instead

				        return ctx.db.map_reduce0([](replica::database& db) -> uint64_t {

				        return db.map_reduce0([](replica::database& db) -> uint64_t {

				            return db.row_cache_tracker().region().occupancy().used_space();

				        }, uint64_t(0), std::plus<uint64_t>()).then([](const int64_t& res) {

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				    cs::get_row_entries.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return ctx.db.map_reduce0([](replica::database& db) -> uint64_t {

				    cs::get_row_entries.set(r, [&db] (std::unique_ptr<http::request> req) {

				        return db.map_reduce0([](replica::database& db) -> uint64_t {

				            return db.row_cache_tracker().partitions();

				        }, uint64_t(0), std::plus<uint64_t>()).then([](const int64_t& res) {

				            return make_ready_future<json::json_return_type>(res);

									
										7

api/cache_service.hh
									
												View File
												
				@@ -7,15 +7,20 @@

				 */

				#pragma once

				#include <seastar/core/sharded.hh>

				namespace seastar::httpd {

				class routes;

				}

				namespace replica {

				class database;

				}

				namespace api {

				struct http_context;

				void set_cache_service(http_context& ctx, seastar::httpd::routes& r);

				void set_cache_service(http_context& ctx, seastar::sharded<replica::database>& db, seastar::httpd::routes& r);

				void unset_cache_service(http_context& ctx, seastar::httpd::routes& r);

				}

									
										1

api/collectd.cc
									
												View File
												
				@@ -10,7 +10,6 @@

				#include "api/api-doc/collectd.json.hh"

				#include <seastar/core/scollectd.hh>

				#include <seastar/core/scollectd_api.hh>

				#include <boost/range/irange.hpp>

				#include <ranges>

				#include <regex>

				#include "api/api_init.hh"

662

api/column_family.cc

View File

File diff suppressed because it is too large Load Diff

									
										62

api/column_family.hh
									
												View File
												
				@@ -13,25 +13,25 @@

				#include <any>

				#include "api/api_init.hh"

				namespace db {

				class system_keyspace;

				}

				namespace api {

				void set_column_family(http_context& ctx, httpd::routes& r, sharded<db::system_keyspace>& sys_ks);

				void set_column_family(http_context& ctx, httpd::routes& r, sharded<replica::database>& db);

				void unset_column_family(http_context& ctx, httpd::routes& r);

				table_id get_uuid(const sstring& name, const replica::database& db);

				table_info parse_table_info(const sstring& name, const replica::database& db);

				template<class Mapper, class I, class Reducer>

				future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,

				future<I> map_reduce_cf_raw(sharded<replica::database>& db, const sstring& name, I init,

				        Mapper mapper, Reducer reducer) {

				    auto uuid = get_uuid(name, ctx.db.local());

				    using mapper_type = std::function<std::unique_ptr<std::any>(replica::database&)>;

				    auto uuid = parse_table_info(name, db.local()).id;

				    using mapper_type = std::function<future<std::unique_ptr<std::any>>(replica::database&)>;

				    using reducer_type = std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)>;

				    return ctx.db.map_reduce0(mapper_type([mapper, uuid](replica::database& db) {

				        return std::make_unique<std::any>(I(mapper(db.find_column_family(uuid))));

				    return db.map_reduce0(mapper_type([mapper, uuid](replica::database& db) {

				        return futurize_invoke([mapper, &db, uuid] {

				            return mapper(db.find_column_family(uuid));

				        }).then([] (auto result) {

				            return std::make_unique<std::any>(I(std::move(result)));

				        });

				    }), std::make_unique<std::any>(std::move(init)), reducer_type([reducer = std::move(reducer)] (std::unique_ptr<std::any> a, std::unique_ptr<std::any> b) mutable {

				        return std::make_unique<std::any>(I(reducer(std::any_cast<I>(std::move(*a)), std::any_cast<I>(std::move(*b)))));

				    })).then([] (std::unique_ptr<std::any> r) {

				@@ -41,33 +41,30 @@ future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,

				template<class Mapper, class I, class Reducer>

				future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& name, I init,

				future<json::json_return_type> map_reduce_cf(sharded<replica::database>& db, const sstring& name, I init,

				        Mapper mapper, Reducer reducer) {

				    return map_reduce_cf_raw(ctx, name, init, mapper, reducer).then([](const I& res) {

				    return map_reduce_cf_raw(db, name, init, mapper, reducer).then([](const I& res) {

				        return make_ready_future<json::json_return_type>(res);

				    });

				}

				template<class Mapper, class I, class Reducer, class Result>

				future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& name, I init,

				future<json::json_return_type> map_reduce_cf(sharded<replica::database>& db, const sstring& name, I init,

				        Mapper mapper, Reducer reducer, Result result) {

				    return map_reduce_cf_raw(ctx, name, init, mapper, reducer).then([result](const I& res) mutable {

				    return map_reduce_cf_raw(db, name, init, mapper, reducer).then([result](const I& res) mutable {

				        result = res;

				        return make_ready_future<json::json_return_type>(result);

				    });

				}

				future<json::json_return_type> map_reduce_cf_time_histogram(http_context& ctx, const sstring& name, std::function<utils::time_estimated_histogram(const replica::column_family&)> f);

				struct map_reduce_column_families_locally {

				    std::any init;

				    std::function<std::unique_ptr<std::any>(replica::column_family&)> mapper;

				    std::function<future<std::unique_ptr<std::any>>(replica::column_family&)> mapper;

				    std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)> reducer;

				    future<std::unique_ptr<std::any>> operator()(replica::database& db) const {

				        auto res = seastar::make_lw_shared<std::unique_ptr<std::any>>(std::make_unique<std::any>(init));

				        return db.get_tables_metadata().for_each_table_gently([res, this] (table_id, seastar::lw_shared_ptr<replica::table> table) {

				            *res = reducer(std::move(*res), mapper(*table.get()));

				            return make_ready_future();

				        return db.get_tables_metadata().for_each_table_gently([res, this] (table_id, seastar::lw_shared_ptr<replica::table> table) -> future<> {

				            *res = reducer(std::move(*res), co_await mapper(*table.get()));

				        }).then([res] () {

				            return std::move(*res);

				        });

				@@ -75,17 +72,21 @@ struct map_reduce_column_families_locally {

				};

				template<class Mapper, class I, class Reducer>

				future<I> map_reduce_cf_raw(http_context& ctx, I init,

				future<I> map_reduce_cf_raw(sharded<replica::database>& db, I init,

				        Mapper mapper, Reducer reducer) {

				    using mapper_type = std::function<std::unique_ptr<std::any>(replica::column_family&)>;

				    using mapper_type = std::function<future<std::unique_ptr<std::any>>(replica::column_family&)>;

				    using reducer_type = std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)>;

				    auto wrapped_mapper = mapper_type([mapper = std::move(mapper)] (replica::column_family& cf) mutable {

				        return std::make_unique<std::any>(I(mapper(cf)));

				        return futurize_invoke([&cf, mapper] {

				            return mapper(cf);

				        }).then([] (auto result) {

				            return std::make_unique<std::any>(I(std::move(result)));

				        });

				    });

				    auto wrapped_reducer = reducer_type([reducer = std::move(reducer)] (std::unique_ptr<std::any> a, std::unique_ptr<std::any> b) mutable {

				        return std::make_unique<std::any>(I(reducer(std::any_cast<I>(std::move(*a)), std::any_cast<I>(std::move(*b)))));

				    });

				    return ctx.db.map_reduce0(map_reduce_column_families_locally{init,

				    return db.map_reduce0(map_reduce_column_families_locally{init,

				            std::move(wrapped_mapper), wrapped_reducer}, std::make_unique<std::any>(init), wrapped_reducer).then([] (std::unique_ptr<std::any> res) {

				        return std::any_cast<I>(std::move(*res));

				    });

				@@ -93,20 +94,13 @@ future<I> map_reduce_cf_raw(http_context& ctx, I init,

				template<class Mapper, class I, class Reducer>

				future<json::json_return_type> map_reduce_cf(http_context& ctx, I init,

				future<json::json_return_type> map_reduce_cf(sharded<replica::database>& db, I init,

				        Mapper mapper, Reducer reducer) {

				    return map_reduce_cf_raw(ctx, init, mapper, reducer).then([](const I& res) {

				    return map_reduce_cf_raw(db, init, mapper, reducer).then([](const I& res) {

				        return make_ready_future<json::json_return_type>(res);

				    });

				}

				future<json::json_return_type>  get_cf_stats(http_context& ctx, const sstring& name,

				        int64_t replica::column_family_stats::*f);

				future<json::json_return_type>  get_cf_stats(http_context& ctx,

				        int64_t replica::column_family_stats::*f);

				std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name);

				}

									
										71

api/compaction_manager.cc
									
												View File
												
				@@ -14,6 +14,7 @@

				#include "api/api.hh"

				#include "api/api-doc/compaction_manager.json.hh"

				#include "api/api-doc/storage_service.json.hh"

				#include "db/compaction_history_entry.hh"

				#include "db/system_keyspace.hh"

				#include "column_family.hh"

				#include "unimplemented.hh"

				@@ -28,9 +29,9 @@ namespace ss = httpd::storage_service_json;

				using namespace json;

				using namespace seastar::httpd;

				static future<json::json_return_type> get_cm_stats(sharded<compaction_manager>& cm,

				        int64_t compaction_manager::stats::*f) {

				    return cm.map_reduce0([f](compaction_manager& cm) {

				static future<json::json_return_type> get_cm_stats(sharded<compaction::compaction_manager>& cm,

				        int64_t compaction::compaction_manager::stats::*f) {

				    return cm.map_reduce0([f](compaction::compaction_manager& cm) {

				        return cm.get_stats().*f;

				    }, int64_t(0), std::plus<int64_t>()).then([](const int64_t& res) {

				        return make_ready_future<json::json_return_type>(res);

				@@ -46,9 +47,9 @@ static std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_ha

				    return std::move(a);

				}

				void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_manager>& cm) {

				void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction::compaction_manager>& cm) {

				    cm::get_compactions.set(r, [&cm] (std::unique_ptr<http::request> req) {

				        return cm.map_reduce0([](compaction_manager& cm) {

				        return cm.map_reduce0([](compaction::compaction_manager& cm) {

				            std::vector<cm::summary> summaries;

				            for (const auto& c : cm.get_compactions()) {

				@@ -57,7 +58,7 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				                s.ks = c.ks_name;

				                s.cf = c.cf_name;

				                s.unit = "keys";

				                s.task_type = sstables::compaction_name(c.type);

				                s.task_type = compaction::compaction_name(c.type);

				                s.completed = c.total_keys_written;

				                s.total = c.total_partitions;

				                summaries.push_back(std::move(s));

				@@ -71,10 +72,9 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				    cm::get_pending_tasks_by_table.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return ctx.db.map_reduce0([](replica::database& db) {

				            return do_with(std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>(), [&db](std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_hash>& tasks) {

				                return db.get_tables_metadata().for_each_table_gently([&tasks] (table_id, lw_shared_ptr<replica::table> table) {

				                return db.get_tables_metadata().for_each_table_gently([&tasks] (table_id, lw_shared_ptr<replica::table> table) -> future<> {

				                    replica::table& cf = *table.get();

				                    tasks[std::make_pair(cf.schema()->ks_name(), cf.schema()->cf_name())] = cf.estimate_pending_compactions();

				                    return make_ready_future<>();

				                    tasks[std::make_pair(cf.schema()->ks_name(), cf.schema()->cf_name())] = co_await cf.estimate_pending_compactions();

				                }).then([&tasks] {

				                    return std::move(tasks);

				                });

				@@ -103,23 +103,20 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				    cm::stop_compaction.set(r, [&cm] (std::unique_ptr<http::request> req) {

				        auto type = req->get_query_param("type");

				        return cm.invoke_on_all([type] (compaction_manager& cm) {

				        return cm.invoke_on_all([type] (compaction::compaction_manager& cm) {

				            return cm.stop_compaction(type);

				        }).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    cm::stop_keyspace_compaction.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto ks_name = validate_keyspace(ctx, req);

				        auto table_names = parse_tables(ks_name, ctx, req->query_parameters, "tables");

				    cm::stop_keyspace_compaction.set(r, [&ctx, &cm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto [ks_name, tables] = parse_table_infos(ctx, *req, "tables");

				        auto type = req->get_query_param("type");

				        co_await ctx.db.invoke_on_all([&] (replica::database& db) {

				            auto& cm = db.get_compaction_manager();

				            return parallel_for_each(table_names, [&] (sstring& table_name) {

				                auto& t = db.find_column_family(ks_name, table_name);

				                return t.parallel_foreach_table_state([&] (compaction::table_state& ts) {

				                    return cm.stop_compaction(type, &ts);

				        co_await cm.invoke_on_all([&] (compaction::compaction_manager& cm) {

				            return parallel_for_each(tables, [&] (const table_info& ti) {

				                return cm.stop_compaction(type, [id = ti.id] (const compaction::compaction_group_view* x) {

				                    return x->schema()->id() == id;

				                });

				            });

				        });

				@@ -127,13 +124,13 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				    });

				    cm::get_pending_tasks.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {

				        return map_reduce_cf(ctx.db, int64_t(0), [](replica::column_family& cf) {

				            return cf.estimate_pending_compactions();

				        }, std::plus<int64_t>());

				    });

				    cm::get_completed_tasks.set(r, [&cm] (std::unique_ptr<http::request> req) {

				        return get_cm_stats(cm, &compaction_manager::stats::completed_tasks);

				        return get_cm_stats(cm, &compaction::compaction_manager::stats::completed_tasks);

				    });

				    cm::get_total_compactions_completed.set(r, [] (std::unique_ptr<http::request> req) {

				@@ -151,7 +148,7 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				    });

				    cm::get_compaction_history.set(r, [&cm] (std::unique_ptr<http::request> req) {

				        std::function<future<>(output_stream<char>&&)> f = [&cm] (output_stream<char>&& out) -> future<> {

				        noncopyable_function<future<>(output_stream<char>&&)> f = [&cm] (output_stream<char>&& out) -> future<> {

				            auto s = std::move(out);

				            bool first = true;

				            std::exception_ptr ex;

				@@ -160,8 +157,11 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				                co_await cm.local().get_compaction_history([&s, &first](const db::compaction_history_entry& entry) mutable -> future<> {

				                        cm::history h;

				                        h.id = fmt::to_string(entry.id);

				                        h.shard_id = entry.shard_id;

				                        h.ks = std::move(entry.ks);

				                        h.cf = std::move(entry.cf);

				                        h.compaction_type = entry.compaction_type;

				                        h.started_at = entry.started_at;

				                        h.compacted_at = entry.compacted_at;

				                        h.bytes_in = entry.bytes_in;

				                        h.bytes_out =  entry.bytes_out;

				@@ -173,6 +173,24 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				                            e.value = it.second;

				                            h.rows_merged.push(std::move(e));

				                        }

				                        for (const auto& data : entry.sstables_in) {

				                            httpd::compaction_manager_json::sstableinfo sstable;

				                            sstable.generation = fmt::to_string(data.generation),

				                            sstable.origin = data.origin,

				                            sstable.size = data.size,

				                            h.sstables_in.push(std::move(sstable));

				                        }

				                        for (const auto& data : entry.sstables_out) {

				                            httpd::compaction_manager_json::sstableinfo sstable;

				                            sstable.generation = fmt::to_string(data.generation),

				                            sstable.origin = data.origin,

				                            sstable.size = data.size,

				                            h.sstables_out.push(std::move(sstable));

				                        }

				                        h.total_tombstone_purge_attempt = entry.total_tombstone_purge_attempt;

				                        h.total_tombstone_purge_failure_due_to_overlapping_with_memtable = entry.total_tombstone_purge_failure_due_to_overlapping_with_memtable;

				                        h.total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable = entry.total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable;

				                        if (!first) {

				                            co_await s.write(", ");

				                        }

				@@ -204,14 +222,6 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				        int value = cm.local().throughput_mbs();

				        return make_ready_future<json::json_return_type>(value);

				    });

				    ss::set_compaction_throughput_mb_per_sec.set(r, [](std::unique_ptr<http::request> req) {

				        //TBD

				        unimplemented();

				        auto value = req->get_query_param("value");

				        return make_ready_future<json::json_return_type>(json_void());

				    });

				}

				void unset_compaction_manager(http_context& ctx, routes& r) {

				@@ -227,7 +237,6 @@ void unset_compaction_manager(http_context& ctx, routes& r) {

				    cm::get_compaction_history.unset(r);

				    cm::get_compaction_info.unset(r);

				    ss::get_compaction_throughput_mb_per_sec.unset(r);

				    ss::set_compaction_throughput_mb_per_sec.unset(r);

				}

				}

									
										4

api/compaction_manager.hh
									
												View File
												
				@@ -13,11 +13,13 @@ namespace seastar::httpd {

				class routes;

				}

				namespace compaction {

				class compaction_manager;

				}

				namespace api {

				struct http_context;

				void set_compaction_manager(http_context& ctx, seastar::httpd::routes& r, seastar::sharded<compaction_manager>& cm);

				void set_compaction_manager(http_context& ctx, seastar::httpd::routes& r, seastar::sharded<compaction::compaction_manager>& cm);

				void unset_compaction_manager(http_context& ctx, seastar::httpd::routes& r);

				}

									
										34

api/config.cc
									
												View File
												
				@@ -14,6 +14,7 @@

				#include "replica/database.hh"

				#include "db/config.hh"

				#include <sstream>

				#include <fmt/ranges.h>

				#include <boost/algorithm/string/replace.hpp>

				#include <seastar/http/exception.hh>

				@@ -22,22 +23,6 @@ using namespace seastar::httpd;

				namespace sp = httpd::storage_proxy_json;

				namespace ss = httpd::storage_service_json;

				template<class T>

				json::json_return_type get_json_return_type(const T& val) {

				    return json::json_return_type(val);

				}

				/*

				 * As commented on db::seed_provider_type is not used

				 * and probably never will.

				 *

				 * Just in case, we will return its name

				 */

				template<>

				json::json_return_type get_json_return_type(const db::seed_provider_type& val) {

				    return json::json_return_type(val.class_name);

				}

				std::string_view format_type(std::string_view type) {

				    if (type == "int") {

				        return "integer";

				@@ -83,7 +68,7 @@ future<> get_config_swagger_entry(std::string_view name, const std::string& desc

				namespace cs = httpd::config_json;

				void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r, const db::config& cfg, bool first) {

				void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx, routes& r, db::config& cfg, bool first) {

				    rb->register_function(r, [&cfg, first] (output_stream<char>& os) {

				        return do_with(first, [&os, &cfg] (bool& first) {

				            auto f = make_ready_future();

				@@ -186,13 +171,24 @@ void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx

				    });

				    ss::get_all_data_file_locations.set(r, [&cfg](const_req req) {

				        return container_to_vec(cfg.data_file_directories());

				        return cfg.data_file_directories();

				    });

				    ss::get_saved_caches_location.set(r, [&cfg](const_req req) {

				        return cfg.saved_caches_directory();

				    });

				    ss::set_compaction_throughput_mb_per_sec.set(r, [&cfg](std::unique_ptr<http::request> req) mutable {

				        api::req_param<uint32_t> value(*req, "value", 0);

				        cfg.compaction_throughput_mb_per_sec(value.value, utils::config_file::config_source::API);

				        return make_ready_future<json::json_return_type>(json::json_void());

				    });

				    ss::set_stream_throughput_mb_per_sec.set(r, [&cfg](std::unique_ptr<http::request> req) mutable {

				        api::req_param<uint32_t> value(*req, "value", 0);

				        cfg.stream_io_throughput_mb_per_sec(value.value, utils::config_file::config_source::API);

				        return make_ready_future<json::json_return_type>(json::json_void());

				    });

				}

				void unset_config(http_context& ctx, routes& r) {

				@@ -213,6 +209,8 @@ void unset_config(http_context& ctx, routes& r) {

				    sp::set_truncate_rpc_timeout.unset(r);

				    ss::get_all_data_file_locations.unset(r);

				    ss::get_saved_caches_location.unset(r);

				    ss::set_compaction_throughput_mb_per_sec.unset(r);

				    ss::set_stream_throughput_mb_per_sec.unset(r);

				}

				}

									
										2

api/config.hh
									
												View File
												
				@@ -13,6 +13,6 @@

				namespace api {

				void set_config(std::shared_ptr<httpd::api_registry_builder20> rb, http_context& ctx, httpd::routes& r, const db::config& cfg, bool first = false);

				void set_config(std::shared_ptr<httpd::api_registry_builder20> rb, http_context& ctx, httpd::routes& r, db::config& cfg, bool first = false);

				void unset_config(http_context& ctx, httpd::routes& r);

				}

									
										12

api/cql_server_test.cc
									
												View File
												
				@@ -6,6 +6,8 @@

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#include "build_mode.hh"

				#ifndef SCYLLA_BUILD_MODE_RELEASE

				#include <seastar/core/coroutine.hh>

				@@ -26,21 +28,24 @@ struct connection_sl_params : public json::json_base {

				    json::json_element<sstring> _role_name;

				    json::json_element<sstring> _workload_type;

				    json::json_element<sstring> _timeout;

				    json::json_element<sstring> _scheduling_group;

				    connection_sl_params(const sstring& role_name, const sstring& workload_type, const sstring& timeout) {

				    connection_sl_params(const sstring& role_name, const sstring& workload_type, const sstring& timeout, const sstring& scheduling_group) {

				        _role_name = role_name;

				        _workload_type = workload_type;

				        _timeout = timeout;

				        _scheduling_group = scheduling_group;

				        register_params();

				    }

				    connection_sl_params(const connection_sl_params& params)

				        : connection_sl_params(params._role_name(), params._workload_type(), params._timeout()) {}

				        : connection_sl_params(params._role_name(), params._workload_type(), params._timeout(), params._scheduling_group()) {}

				    void register_params() {

				        add(&_role_name, "role_name");

				        add(&_workload_type, "workload_type");

				        add(&_timeout, "timeout");

				        add(&_scheduling_group, "scheduling_group");

				    }    

				};

				@@ -54,7 +59,8 @@ void set_cql_server_test(http_context& ctx, seastar::httpd::routes& r, cql_trans

				            return connection_sl_params(

				                    std::move(params.role_name), 

				                    sstring(qos::service_level_options::to_string(params.workload_type)), 

				                    to_string(cql_duration(months_counter{0}, days_counter{0}, nanoseconds_counter{nanos})));

				                    to_string(cql_duration(months_counter{0}, days_counter{0}, nanoseconds_counter{nanos})),

				                    std::move(params.scheduling_group_name));

				        });

				        co_return result;

				    });

									
										9

api/error_injection.cc
									
												View File
												
				@@ -21,10 +21,10 @@ namespace hf = httpd::error_injection_json;

				void set_error_injection(http_context& ctx, routes& r) {

				    hf::enable_injection.set(r, [](std::unique_ptr<request> req) {

				    hf::enable_injection.set(r, [](std::unique_ptr<request> req) -> future<json::json_return_type> {

				        sstring injection = req->get_path_param("injection");

				        bool one_shot = req->get_query_param("one_shot") == "True";

				        auto params = req->content;

				        auto params = co_await util::read_entire_stream_contiguous(*req->content_stream);

				        const size_t max_params_size = 1024 * 1024;

				        if (params.size() > max_params_size) {

				@@ -39,12 +39,11 @@ void set_error_injection(http_context& ctx, routes& r) {

				                : rjson::parse_to_map<utils::error_injection_parameters>(params);

				            auto& errinj = utils::get_local_injector();

				            return errinj.enable_on_all(injection, one_shot, std::move(parameters)).then([] {

				                return make_ready_future<json::json_return_type>(json::json_void());

				            });

				            co_await errinj.enable_on_all(injection, one_shot, std::move(parameters));

				        } catch (const rjson::error& e) {

				            throw httpd::bad_param_exception(format("Failed to parse injections parameters: {}", e.what()));

				        }

				        co_return json::json_void();

				    });

				    hf::get_enabled_injections_on_all.set(r, [](std::unique_ptr<request> req) {

									
										24

api/failure_detector.cc
									
												View File
												
				@@ -22,10 +22,10 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {

				        return g.container().invoke_on(0, [] (gms::gossiper& g) {

				            std::vector<fd::endpoint_state> res;

				            res.reserve(g.num_endpoints());

				            g.for_each_endpoint_state([&] (const gms::inet_address& addr, const gms::endpoint_state& eps) {

				            g.for_each_endpoint_state([&] (const gms::endpoint_state& eps) {

				                fd::endpoint_state val;

				                val.addrs = fmt::to_string(addr);

				                val.is_alive = g.is_alive(addr);

				                val.addrs = fmt::to_string(eps.get_ip());

				                val.is_alive = g.is_alive(eps.get_host_id());

				                val.generation = eps.get_heart_beat_state().get_generation().value();

				                val.version = eps.get_heart_beat_state().get_heart_beat_version().value();

				                val.update_time = eps.get_update_timestamp().time_since_epoch().count();

				@@ -40,7 +40,9 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {

				                }

				                res.emplace_back(std::move(val));

				            });

				            return make_ready_future<json::json_return_type>(res);

				            return make_ready_future<json::json_return_type>(json::stream_range_as_array(res, [](const fd::endpoint_state& i){

				                return i;

				            }));

				        });

				    });

				@@ -64,11 +66,15 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {

				    fd::get_simple_states.set(r, [&g] (std::unique_ptr<request> req) {

				        return g.container().invoke_on(0, [] (gms::gossiper& g) {

				            std::map<sstring, sstring> nodes_status;

				            g.for_each_endpoint_state([&] (const gms::inet_address& node, const gms::endpoint_state&) {

				                nodes_status.emplace(fmt::to_string(node), g.is_alive(node) ? "UP" : "DOWN");

				            std::vector<fd::mapper> nodes_status;

				            nodes_status.reserve(g.num_endpoints());

				            g.for_each_endpoint_state([&] (const gms::endpoint_state& es) {

				                fd::mapper val;

				                val.key = fmt::to_string(es.get_ip());

				                val.value = g.is_alive(es.get_host_id()) ? "UP" : "DOWN";

				                nodes_status.emplace_back(std::move(val));

				            });

				            return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(nodes_status));

				            return make_ready_future<json::json_return_type>(std::move(nodes_status));

				        });

				    });

				@@ -81,7 +87,7 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {

				    fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {

				        return g.container().invoke_on(0, [req = std::move(req)] (gms::gossiper& g) {

				            auto state = g.get_endpoint_state_ptr(gms::inet_address(req->get_path_param("addr")));

				            auto state = g.get_endpoint_state_ptr(g.get_host_id(gms::inet_address(req->get_path_param("addr"))));

				            if (!state) {

				                return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->get_path_param("addr")));

				            }

									
										24

api/gossiper.cc
									
												View File
												
				@@ -21,51 +21,45 @@ using namespace json;

				void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {

				    httpd::gossiper_json::get_down_endpoint.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto res = co_await g.get_unreachable_members_synchronized();

				        co_return json::json_return_type(container_to_vec(res));

				        co_return json::json_return_type(res | std::views::transform([] (auto& ep) { return fmt::to_string(ep); }) | std::ranges::to<std::vector>());

				    });

				    httpd::gossiper_json::get_live_endpoint.set(r, [&g] (std::unique_ptr<request> req) {

				        return g.get_live_members_synchronized().then([] (auto res) {

				            return make_ready_future<json::json_return_type>(container_to_vec(res));

				        });

				    httpd::gossiper_json::get_live_endpoint.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto res = co_await g.get_live_members_synchronized();

				        co_return json::json_return_type(res | std::views::transform([] (auto& ep) { return fmt::to_string(ep); }) | std::ranges::to<std::vector>());

				    });

				    httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        gms::inet_address ep(req->get_path_param("addr"));

				        // synchronize unreachable_members on all shards

				        co_await g.get_unreachable_members_synchronized();

				        co_return g.get_endpoint_downtime(ep);

				        co_return g.get_endpoint_downtime(g.get_host_id(ep));

				    });

				    httpd::gossiper_json::get_current_generation_number.set(r, [&g] (std::unique_ptr<http::request> req) {

				        gms::inet_address ep(req->get_path_param("addr"));

				        return g.get_current_generation_number(ep).then([] (gms::generation_type res) {

				        return g.get_current_generation_number(g.get_host_id(ep)).then([] (gms::generation_type res) {

				            return make_ready_future<json::json_return_type>(res.value());

				        });

				    });

				    httpd::gossiper_json::get_current_heart_beat_version.set(r, [&g] (std::unique_ptr<http::request> req) {

				        gms::inet_address ep(req->get_path_param("addr"));

				        return g.get_current_heart_beat_version(ep).then([] (gms::version_type res) {

				        return g.get_current_heart_beat_version(g.get_host_id(ep)).then([] (gms::version_type res) {

				            return make_ready_future<json::json_return_type>(res.value());

				        });

				    });

				    httpd::gossiper_json::assassinate_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {

				        if (req->get_query_param("unsafe") != "True") {

				            return g.assassinate_endpoint(req->get_path_param("addr")).then([] {

				                return make_ready_future<json::json_return_type>(json_void());

				            });

				        }

				        return g.unsafe_assassinate_endpoint(req->get_path_param("addr")).then([] {

				        return g.assassinate_endpoint(req->get_path_param("addr")).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    httpd::gossiper_json::force_remove_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {

				        gms::inet_address ep(req->get_path_param("addr"));

				        return g.force_remove_endpoint(ep, gms::null_permit_id).then([] {

				        return g.force_remove_endpoint(g.get_host_id(ep), gms::null_permit_id).then([] () {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

									
										13

api/hinted_handoff.cc
									
												View File
												
				@@ -14,6 +14,7 @@

				#include "gms/inet_address.hh"

				#include "service/storage_proxy.hh"

				#include "gms/gossiper.hh"

				namespace api {

				@@ -21,18 +22,18 @@ using namespace json;

				using namespace seastar::httpd;

				namespace hh = httpd::hinted_handoff_json;

				void set_hinted_handoff(http_context& ctx, routes& r, sharded<service::storage_proxy>& proxy) {

				    hh::create_hints_sync_point.set(r, [&proxy] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto parse_hosts_list = [] (sstring arg) {

				void set_hinted_handoff(http_context& ctx, routes& r, sharded<service::storage_proxy>& proxy, sharded<gms::gossiper>& g) {

				    hh::create_hints_sync_point.set(r, [&proxy, &g] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto parse_hosts_list = [&g] (sstring arg) {

				            std::vector<sstring> hosts_str = split(arg, ",");

				            std::vector<gms::inet_address> hosts;

				            std::vector<locator::host_id> hosts;

				            hosts.reserve(hosts_str.size());

				            for (const auto& host_str : hosts_str) {

				                try {

				                    gms::inet_address host;

				                    host = gms::inet_address(host_str);

				                    hosts.push_back(host);

				                    hosts.push_back(g.local().get_host_id(host));

				                } catch (std::exception& e) {

				                    throw httpd::bad_param_exception(format("Failed to parse host address {}: {}", host_str, e.what()));

				                }

				@@ -41,7 +42,7 @@ void set_hinted_handoff(http_context& ctx, routes& r, sharded<service::storage_p

				            return hosts;

				        };

				        std::vector<gms::inet_address> target_hosts = parse_hosts_list(req->get_query_param("target_hosts"));

				        std::vector<locator::host_id> target_hosts = parse_hosts_list(req->get_query_param("target_hosts"));

				        return proxy.local().create_hint_sync_point(std::move(target_hosts)).then([] (db::hints::sync_point sync_point) {

				            return json::json_return_type(sync_point.encode());

				        });

									
										3

api/hinted_handoff.hh
									
												View File
												
				@@ -10,12 +10,13 @@

				#include <seastar/core/sharded.hh>

				#include "api/api_init.hh"

				#include "gms/gossiper.hh"

				namespace service { class storage_proxy; }

				namespace api {

				void set_hinted_handoff(http_context& ctx, httpd::routes& r, sharded<service::storage_proxy>& p);

				void set_hinted_handoff(http_context& ctx, httpd::routes& r, sharded<service::storage_proxy>& p, sharded<gms::gossiper>& g);

				void unset_hinted_handoff(http_context& ctx, httpd::routes& r);

				}

									
										4

api/messaging_service.cc
									
												View File
												
				@@ -114,7 +114,7 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging

				    }));

				    get_version.set(r, [&ms](const_req req) {

				        return ms.local().get_raw_version(gms::inet_address(req.get_query_param("addr")));

				        return ms.local().current_version;

				    });

				    get_dropped_messages_by_ver.set(r, [&ms](std::unique_ptr<request> req) {

				@@ -148,7 +148,7 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging

				    hf::inject_disconnect.set(r, [&ms] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto ip = msg_addr(req->get_path_param("ip"));

				        co_await ms.invoke_on_all([ip] (netw::messaging_service& ms) {

				            ms.remove_rpc_client(ip);

				            ms.remove_rpc_client(ip, std::nullopt);

				        });

				        co_return json::json_void();

				    });

									
										6

api/raft.cc
									
												View File
												
				@@ -71,7 +71,7 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis

				        co_return json_void{};

				    });

				    r::get_leader_host.set(r, [&raft_gr] (std::unique_ptr<http::request> req) -> future<json_return_type> {

				        if (!req->query_parameters.contains("group_id")) {

				        if (req->get_query_param("group_id").empty()) {

				            const auto leader_id = co_await raft_gr.invoke_on(0, [] (service::raft_group_registry& raft_gr) {

				                auto& srv = raft_gr.group0();

				                return srv.current_leader();

				@@ -100,7 +100,7 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis

				    r::read_barrier.set(r, [&raft_gr] (std::unique_ptr<http::request> req) -> future<json_return_type> {

				        auto timeout = get_request_timeout(*req);

				        if (!req->query_parameters.contains("group_id")) {

				        if (req->get_query_param("group_id").empty()) {

				            // Read barrier on group 0 by default

				            co_await raft_gr.invoke_on(0, [timeout] (service::raft_group_registry& raft_gr) -> future<> {

				                co_await raft_gr.group0_with_timeouts().read_barrier(nullptr, timeout);

				@@ -131,7 +131,7 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis

				        const auto stepdown_timeout_ticks = dur / service::raft_tick_interval;

				        auto timeout_dur = raft::logical_clock::duration(stepdown_timeout_ticks);

				        if (!req->query_parameters.contains("group_id")) {

				        if (req->get_query_param("group_id").empty()) {

				            // Stepdown on group 0 by default

				            co_await raft_gr.invoke_on(0, [timeout_dur] (service::raft_group_registry& raft_gr) {

				                apilog.info("Triggering stepdown for group0");

									
										63

api/service_levels.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,63 @@

				/*

				 * Copyright (C) 2023-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#include "service_levels.hh"

				#include "api/api-doc/service_levels.json.hh"

				#include "cql3/query_processor.hh"

				#include "cql3/untyped_result_set.hh"

				#include "db/consistency_level_type.hh"

				#include <seastar/json/json_elements.hh>

				#include "transport/controller.hh"

				#include <unordered_map>

				namespace api {

				namespace sl = httpd::service_levels_json;

				using namespace json;

				using namespace seastar::httpd;

				void set_service_levels(http_context& ctx, routes& r, cql_transport::controller& ctl, sharded<cql3::query_processor>& qp) {

				    sl::do_switch_tenants.set(r, [&ctl] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        co_await ctl.update_connections_scheduling_group();

				        co_return json_void();

				    });

				    sl::count_connections.set(r, [&qp] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto connections = co_await qp.local().execute_internal(

				            "SELECT username, scheduling_group FROM system.clients WHERE client_type='cql' ALLOW FILTERING",

				            db::consistency_level::LOCAL_ONE,

				            cql3::query_processor::cache_internal::no

				        );

				        using connections_per_user = std::unordered_map<sstring, uint64_t>;

				        using connections_per_scheduling_group = std::unordered_map<sstring, connections_per_user>;

				        connections_per_scheduling_group result;

				        for (auto it = connections->begin(); it != connections->end(); it++) {

				            auto user = it->get_as<sstring>("username");

				            auto shg = it->get_as<sstring>("scheduling_group");

				            if (result.contains(shg)) {

				                result[shg][user]++;

				            }

				            else {

				                result[shg] = {{user, 1}};

				            }

				        }

				        co_return result;

				    });

				}

				}

									
										17

api/service_levels.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,17 @@

				/*

				 * Copyright (C) 2023-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#pragma once

				#include "api/api_init.hh"

				namespace api {

				void set_service_levels(http_context& ctx, httpd::routes& r, cql_transport::controller& ctl, sharded<cql3::query_processor>& qp);

				}

									
										16

api/storage_proxy.cc
									
												View File
												
				@@ -39,7 +39,7 @@ utils::time_estimated_histogram timed_rate_moving_average_summary_merge(utils::t

				 * @return A future that resolves to the result of the aggregation.

				 */

				template<typename V, typename Reducer, typename InnerMapper>

				future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,

				future<V> two_dimensional_map_reduce(sharded<service::storage_proxy>& d,

				        InnerMapper mapper, Reducer reducer, V initial_value) {

				    return d.map_reduce0( [mapper, reducer, initial_value] (const service::storage_proxy& sp) {

				        return map_reduce_scheduling_group_specific<service::storage_proxy_stats::stats>(

				@@ -59,7 +59,7 @@ future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,

				 * @return A future that resolves to the result of the aggregation.

				 */

				template<typename V, typename Reducer, typename F, typename C>

				future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,

				future<V> two_dimensional_map_reduce(sharded<service::storage_proxy>& d,

				        C F::*f, Reducer reducer, V initial_value) {

				    return two_dimensional_map_reduce(d, [f] (F& stats) -> V {

				        return stats.*f;

				@@ -75,20 +75,20 @@ future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,

				 *

				 */

				template<typename V, typename F>

				future<json::json_return_type>  sum_stats_storage_proxy(distributed<proxy>& d, V F::*f) {

				future<json::json_return_type>  sum_stats_storage_proxy(sharded<proxy>& d, V F::*f) {

				    return two_dimensional_map_reduce(d, [f] (F& stats) { return stats.*f; }, std::plus<V>(), V(0)).then([] (V val) {

				        return make_ready_future<json::json_return_type>(val);

				    });

				}

				static future<utils::rate_moving_average>  sum_timed_rate(distributed<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {

				static future<utils::rate_moving_average>  sum_timed_rate(sharded<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {

				    return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {

				        return (stats.*f).rate();

				    }, std::plus<utils::rate_moving_average>(), utils::rate_moving_average());

				}

				static future<json::json_return_type>  sum_timed_rate_as_obj(distributed<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {

				static future<json::json_return_type>  sum_timed_rate_as_obj(sharded<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {

				    return sum_timed_rate(d, f).then([](const utils::rate_moving_average& val) {

				        httpd::utils_json::rate_moving_average m;

				        m = val;

				@@ -100,7 +100,7 @@ httpd::utils_json::rate_moving_average_and_histogram get_empty_moving_average()

				    return timer_to_json(utils::rate_moving_average_and_histogram());

				}

				static future<json::json_return_type>  sum_timed_rate_as_long(distributed<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {

				static future<json::json_return_type>  sum_timed_rate_as_long(sharded<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {

				    return sum_timed_rate(d, f).then([](const utils::rate_moving_average& val) {

				        return make_ready_future<json::json_return_type>(val.count);

				    });

				@@ -152,7 +152,7 @@ static future<json::json_return_type>  total_latency(sharded<service::storage_pr

				 */

				template<typename F>

				future<json::json_return_type>

				sum_histogram_stats_storage_proxy(distributed<proxy>& d,

				sum_histogram_stats_storage_proxy(sharded<proxy>& d,

				        utils::timed_rate_moving_average_summary_and_histogram F::*f) {

				    return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {

				        return (stats.*f).hist;

				@@ -172,7 +172,7 @@ sum_histogram_stats_storage_proxy(distributed<proxy>& d,

				 */

				template<typename F>

				future<json::json_return_type>

				sum_timer_stats_storage_proxy(distributed<proxy>& d,

				sum_timer_stats_storage_proxy(sharded<proxy>& d,

				        utils::timed_rate_moving_average_summary_and_histogram F::*f) {

				    return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {

1244

api/storage_service.cc

View File

File diff suppressed because it is too large Load Diff

									
										32

api/storage_service.hh
									
												View File
												
				@@ -43,39 +43,34 @@ sstring validate_keyspace(const http_context& ctx, sstring ks_name);

				// containing the description of the respective keyspace error.

				sstring validate_keyspace(const http_context& ctx, const std::unique_ptr<http::request>& req);

				// verify that the table parameter is found, otherwise a bad_param_exception exception is thrown

				// containing the description of the respective table error.

				void validate_table(const http_context& ctx, sstring ks_name, sstring table_name);

				// splits a request parameter assumed to hold a comma-separated list of table names

				// verify that the tables are found, otherwise a bad_param_exception exception is thrown

				// containing the description of the respective no_such_column_family error.

				// Returns an empty vector if no parameter was found.

				// If the parameter is found and empty, returns a list of all table names in the keyspace.

				std::vector<sstring> parse_tables(const sstring& ks_name, const http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);

				// verify that the keyspace:table is found, otherwise a bad_param_exception exception is thrown

				// returns the table_id of the table if found

				table_id validate_table(const replica::database& db, sstring ks_name, sstring table_name);

				// splits a request parameter assumed to hold a comma-separated list of table names

				// verify that the tables are found, otherwise a bad_param_exception exception is thrown

				// containing the description of the respective no_such_column_family error.

				// Returns a vector of all table infos given by the parameter, or

				// if the parameter is not found or is empty, returns a list of all table infos in the keyspace.

				std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);

				std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, sstring value);

				std::pair<sstring, std::vector<table_info>> parse_table_infos(const http_context& ctx, const http::request& req, sstring cf_param_name = "cf");

				struct scrub_info {

				    sstables::compaction_type_options::scrub opts;

				    compaction::compaction_type_options::scrub opts;

				    sstring keyspace;

				    std::vector<sstring> column_families;

				    sstring snapshot_tag;

				};

				future<scrub_info> parse_scrub_options(const http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl, std::unique_ptr<http::request> req);

				scrub_info parse_scrub_options(const http_context& ctx, std::unique_ptr<http::request> req);

				void set_storage_service(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss, service::raft_group0_client&);

				void unset_storage_service(http_context& ctx, httpd::routes& r);

				void set_sstables_loader(http_context& ctx, httpd::routes& r, sharded<sstables_loader>& sst_loader);

				void unset_sstables_loader(http_context& ctx, httpd::routes& r);

				void set_view_builder(http_context& ctx, httpd::routes& r, sharded<db::view::view_builder>& vb);

				void set_view_builder(http_context& ctx, httpd::routes& r, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g);

				void unset_view_builder(http_context& ctx, httpd::routes& r);

				void set_repair(http_context& ctx, httpd::routes& r, sharded<repair_service>& repair, sharded<gms::gossip_address_map>& am);

				void unset_repair(http_context& ctx, httpd::routes& r);

				@@ -87,6 +82,13 @@ void set_snapshot(http_context& ctx, httpd::routes& r, sharded<db::snapshot_ctl>

				void unset_snapshot(http_context& ctx, httpd::routes& r);

				void set_load_meter(http_context& ctx, httpd::routes& r, service::load_meter& lm);

				void unset_load_meter(http_context& ctx, httpd::routes& r);

				seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);

				seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, bool legacy_request = false);

				// converts string value of boolean parameter into bool

				// maps (case insensitively)

				//     "true", "yes" and "1" into true

				//     "false", "no" and "0" into false

				// otherwise throws runtime_error

				bool validate_bool_x(const sstring& param, bool default_value);

				} // namespace api

									
										8

api/stream_manager.cc
									
												View File
												
				@@ -11,6 +11,7 @@

				#include "streaming/stream_result_future.hh"

				#include "api/api.hh"

				#include "api/api-doc/stream_manager.json.hh"

				#include "api/api-doc/storage_service.json.hh"

				#include <vector>

				#include <rapidjson/document.h>

				#include "gms/gossiper.hh"

				@@ -18,6 +19,7 @@

				namespace api {

				using namespace seastar::httpd;

				namespace ss = httpd::storage_service_json;

				namespace hs = httpd::stream_manager_json;

				static void set_summaries(const std::vector<streaming::stream_summary>& from,

				@@ -148,6 +150,11 @@ void set_stream_manager(http_context& ctx, routes& r, sharded<streaming::stream_

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				    ss::get_stream_throughput_mb_per_sec.set(r, [&sm](std::unique_ptr<http::request> req) {

				        auto value = sm.local().throughput_mbs();

				        return make_ready_future<json::json_return_type>(value);

				    });

				}

				void unset_stream_manager(http_context& ctx, routes& r) {

				@@ -157,6 +164,7 @@ void unset_stream_manager(http_context& ctx, routes& r) {

				    hs::get_all_total_incoming_bytes.unset(r);

				    hs::get_total_outgoing_bytes.unset(r);

				    hs::get_all_total_outgoing_bytes.unset(r);

				    ss::get_stream_throughput_mb_per_sec.unset(r);

				}

				}

									
										41

api/system.cc
									
												View File
												
				@@ -10,9 +10,10 @@

				#include "api/api-doc/system.json.hh"

				#include "api/api-doc/metrics.json.hh"

				#include "replica/database.hh"

				#include "db/sstables-format-selector.hh"

				#include "sstables/sstables_manager.hh"

				#include <rapidjson/document.h>

				#include <boost/lexical_cast.hpp>

				#include <seastar/core/reactor.hh>

				#include <seastar/core/metrics_api.hh>

				#include <seastar/core/relabel_config.hh>

				@@ -53,7 +54,8 @@ void set_system(http_context& ctx, routes& r) {

				    hm::set_metrics_config.set(r, [](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        rapidjson::Document doc;

				        doc.Parse(req->content.c_str());

				        auto content = co_await util::read_entire_stream_contiguous(*req->content_stream);

				        doc.Parse(content.c_str());

				        if (!doc.IsArray()) {

				            throw bad_param_exception("Expected a json array");

				        }

				@@ -86,21 +88,19 @@ void set_system(http_context& ctx, routes& r) {

				                relabels[i].expr = element["regex"].GetString();

				            }

				        }

				        return do_with(std::move(relabels), false, [](const std::vector<seastar::metrics::relabel_config>& relabels, bool& failed) {

				            return smp::invoke_on_all([&relabels, &failed] {

				                return metrics::set_relabel_configs(relabels).then([&failed](const metrics::metric_relabeling_result& result) {

				                    if (result.metrics_relabeled_due_to_collision > 0) {

				                        failed = true;

				                    }

				                    return;

				                });

				            }).then([&failed](){

				                if (failed) {

				                    throw bad_param_exception("conflicts found during relabeling");

				        bool failed = false;

				        co_await smp::invoke_on_all([&relabels, &failed] {

				            return metrics::set_relabel_configs(relabels).then([&failed](const metrics::metric_relabeling_result& result) {

				                if (result.metrics_relabeled_due_to_collision > 0) {

				                    failed = true;

				                }

				                return make_ready_future<json::json_return_type>(seastar::json::json_void());

				                return;

				            });

				        });

				        if (failed) {

				            throw bad_param_exception("conflicts found during relabeling");

				        }

				        co_return seastar::json::json_void();

				    });

				    hs::get_system_uptime.set(r, [](const_req req) {

				@@ -183,18 +183,13 @@ void set_system(http_context& ctx, routes& r) {

				        apilog.info("Profile dumped to {}", profile_dest);

				        return make_ready_future<json::json_return_type>(json::json_return_type(json::json_void()));

				    }) ;

				}

				void set_format_selector(http_context& ctx, routes& r, db::sstables_format_selector& sel) {

				    hs::get_highest_supported_sstable_version.set(r, [&sel] (std::unique_ptr<request> req) {

				        return smp::submit_to(0, [&sel] {

				            return make_ready_future<json::json_return_type>(seastar::to_sstring(sel.selected_format()));

				    hs::get_highest_supported_sstable_version.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return smp::submit_to(0, [&ctx] {

				            auto format = ctx.db.local().get_user_sstables_manager().get_highest_supported_format();

				            return make_ready_future<json::json_return_type>(seastar::to_sstring(format));

				        });

				    });

				}

				void unset_format_selector(http_context& ctx, routes& r) {

				    hs::get_highest_supported_sstable_version.unset(r);

				}

				}

									
										5

api/system.hh
									
												View File
												
				@@ -12,14 +12,9 @@ namespace seastar::httpd {

				class routes;

				}

				namespace db { class sstables_format_selector; }

				namespace api {

				struct http_context;

				void set_system(http_context& ctx, seastar::httpd::routes& r);

				void set_format_selector(http_context& ctx, seastar::httpd::routes& r, db::sstables_format_selector& sel);

				void unset_format_selector(http_context& ctx, seastar::httpd::routes& r);

				}

									
										98

api/task_manager.cc
									
												View File
												
				@@ -6,6 +6,7 @@

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#include <seastar/core/chunked_fifo.hh>

				#include <seastar/core/coroutine.hh>

				#include <seastar/coroutine/exception.hh>

				#include <seastar/http/exception.hh>

				@@ -14,6 +15,7 @@

				#include "api/api.hh"

				#include "api/api-doc/task_manager.json.hh"

				#include "db/system_keyspace.hh"

				#include "gms/gossiper.hh"

				#include "tasks/task_handler.hh"

				#include "utils/overloaded_functor.hh"

				@@ -25,20 +27,26 @@ namespace tm = httpd::task_manager_json;

				using namespace json;

				using namespace seastar::httpd;

				tm::task_status make_status(tasks::task_status status) {

				    auto start_time = db_clock::to_time_t(status.start_time);

				    auto end_time = db_clock::to_time_t(status.end_time);

				    ::tm st, et;

				    ::gmtime_r(&end_time, &et);

				    ::gmtime_r(&start_time, &st);

				static ::tm get_time(db_clock::time_point tp) {

				    auto time = db_clock::to_time_t(tp);

				    ::tm t;

				    ::gmtime_r(&time, &t);

				    return t;

				}

				    std::vector<tm::task_identity> tis{status.children.size()};

				    std::ranges::transform(status.children, tis.begin(), [] (const auto& child) {

				tm::task_status make_status(tasks::task_status status, sharded<gms::gossiper>& gossiper) {

				    chunked_fifo<tm::task_identity> tis;

				    tis.reserve(status.children.size());

				    for (const auto& child : status.children) {

				        tm::task_identity ident;

				        gms::inet_address addr{};

				        if (gossiper.local_is_initialized()) {

				            addr = gossiper.local().get_address_map().find(child.host_id).value_or(gms::inet_address{});

				        }

				        ident.task_id = child.task_id.to_sstring();

				        ident.node = fmt::format("{}", child.node);

				        return ident;

				    });

				        ident.node = fmt::format("{}", addr);

				        tis.push_back(std::move(ident));

				    }

				    tm::task_status res{};

				    res.id = status.task_id.to_sstring();

				@@ -47,8 +55,8 @@ tm::task_status make_status(tasks::task_status status) {

				    res.scope = status.scope;

				    res.state = status.state;

				    res.is_abortable = bool(status.is_abortable);

				    res.start_time = st;

				    res.end_time = et;

				    res.start_time = get_time(status.start_time);

				    res.end_time = get_time(status.end_time);

				    res.error = status.error;

				    res.parent_id = status.parent_id ? status.parent_id.to_sstring() : "none";

				    res.sequence_number = status.sequence_number;

				@@ -74,10 +82,13 @@ tm::task_stats make_stats(tasks::task_stats stats) {

				    res.keyspace = stats.keyspace;

				    res.table = stats.table;

				    res.entity = stats.entity;

				    res.shard = stats.shard;

				    res.start_time = get_time(stats.start_time);

				    res.end_time = get_time(stats.end_time);;

				    return res;

				}

				void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>& tm, db::config& cfg) {

				void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>& tm, db::config& cfg, sharded<gms::gossiper>& gossiper) {

				    tm::get_modules.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        std::vector<std::string> v = tm.local().get_modules() | std::views::keys | std::ranges::to<std::vector>();

				        co_return v;

				@@ -96,11 +107,11 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				                throw bad_param_exception(fmt::format("{}", std::current_exception()));

				            }

				            if (auto it = req->query_parameters.find("keyspace"); it != req->query_parameters.end()) {

				                keyspace = it->second;

				            if (auto param = req->get_query_param("keyspace"); !param.empty()) {

				                keyspace = param;

				            }

				            if (auto it = req->query_parameters.find("table"); it != req->query_parameters.end()) {

				                table = it->second;

				            if (auto param = req->get_query_param("table"); !param.empty()) {

				                table = param;

				            }

				            return module->get_stats(internal, [keyspace = std::move(keyspace), table = std::move(table)] (std::string& ks, std::string& t) {

				@@ -108,7 +119,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				            });

				        });

				        std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {

				        noncopyable_function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {

				            auto s = std::move(os);

				            std::exception_ptr ex;

				            try {

				@@ -135,7 +146,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				        co_return std::move(f);

				    });

				    tm::get_task_status.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				    tm::get_task_status.set(r, [&tm, &gossiper] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};

				        tasks::task_status status;

				        try {

				@@ -144,7 +155,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				        } catch (tasks::task_manager::task_not_found& e) {

				            throw bad_param_exception(e.what());

				        }

				        co_return make_status(status);

				        co_return make_status(status, gossiper);

				    });

				    tm::abort_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				@@ -160,12 +171,12 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				        co_return json_void();

				    });

				    tm::wait_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				    tm::wait_task.set(r, [&tm, &gossiper] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};

				        tasks::task_status status;

				        std::optional<std::chrono::seconds> timeout = std::nullopt;

				        if (auto it = req->query_parameters.find("timeout"); it != req->query_parameters.end()) {

				            timeout = std::chrono::seconds(boost::lexical_cast<uint32_t>(it->second));

				        if (auto param = req->get_query_param("timeout"); !param.empty()) {

				            timeout = std::chrono::seconds(boost::lexical_cast<uint32_t>(param));

				        }

				        try {

				            auto task = tasks::task_handler{tm.local(), id};

				@@ -175,24 +186,24 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				        } catch (timed_out_error& e) {

				            throw httpd::base_exception{e.what(), http::reply::status_type::request_timeout};

				        }

				        co_return make_status(status);

				        co_return make_status(status, gossiper);

				    });

				    tm::get_task_status_recursively.set(r, [&_tm = tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				    tm::get_task_status_recursively.set(r, [&_tm = tm, &gossiper] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& tm = _tm;

				        auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};

				        try {

				            auto task = tasks::task_handler{tm.local(), id};

				            auto res = co_await task.get_status_recursively(true);

				            std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {

				            noncopyable_function<future<>(output_stream<char>&&)> f = [r = std::move(res), &gossiper] (output_stream<char>&& os) -> future<> {

				                auto s = std::move(os);

				                auto res = std::move(r);

				                co_await s.write("[");

				                std::string delim = "";

				                for (auto& status: res) {

				                    co_await s.write(std::exchange(delim, ", "));

				                    co_await formatter::write(s, make_status(status));

				                    co_await formatter::write(s, make_status(status, gossiper));

				                }

				                co_await s.write("]");

				                co_await s.close();

				@@ -206,7 +217,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				    tm::get_and_update_ttl.set(r, [&cfg] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        uint32_t ttl = cfg.task_ttl_seconds();

				        try {

				            co_await cfg.task_ttl_seconds.set_value_on_all_shards(req->query_parameters["ttl"], utils::config_file::config_source::API);

				            co_await cfg.task_ttl_seconds.set_value_on_all_shards(req->get_query_param("ttl"), utils::config_file::config_source::API);

				        } catch (...) {

				            throw bad_param_exception(fmt::format("{}", std::current_exception()));

				        }

				@@ -221,7 +232,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				    tm::get_and_update_user_ttl.set(r, [&cfg] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        uint32_t user_ttl = cfg.user_task_ttl_seconds();

				        try {

				            co_await cfg.user_task_ttl_seconds.set_value_on_all_shards(req->query_parameters["user_ttl"], utils::config_file::config_source::API);

				            co_await cfg.user_task_ttl_seconds.set_value_on_all_shards(req->get_query_param("user_ttl"), utils::config_file::config_source::API);

				        } catch (...) {

				            throw bad_param_exception(fmt::format("{}", std::current_exception()));

				        }

				@@ -232,6 +243,32 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				        uint32_t user_ttl = cfg.user_task_ttl_seconds();

				        co_return json::json_return_type(user_ttl);

				    });

				    tm::drain_tasks.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        co_await tm.invoke_on_all([&req] (tasks::task_manager& tm) -> future<> {

				            tasks::task_manager::module_ptr module;

				            try {

				                module = tm.find_module(req->get_path_param("module"));

				            } catch (...) {

				                throw bad_param_exception(fmt::format("{}", std::current_exception()));

				            }

				            const auto& local_tasks = module->get_local_tasks();

				            std::vector<tasks::task_id> ids;

				            ids.reserve(local_tasks.size());

				            std::transform(begin(local_tasks), end(local_tasks), std::back_inserter(ids), [] (const auto& task) {

				                return task.second->is_complete() ? task.first : tasks::task_id::create_null_id();

				            });

				            for (auto&& id : ids) {

				                if (id) {

				                    module->unregister_task(id);

				                }

				                co_await maybe_yield();

				            }

				        });

				        co_return json_void();

				    });

				}

				void unset_task_manager(http_context& ctx, routes& r) {

				@@ -243,6 +280,7 @@ void unset_task_manager(http_context& ctx, routes& r) {

				    tm::get_task_status_recursively.unset(r);

				    tm::get_and_update_ttl.unset(r);

				    tm::get_ttl.unset(r);

				    tm::drain_tasks.unset(r);

				}

				}

									
										2

api/task_manager.hh
									
												View File
												
				@@ -18,7 +18,7 @@ namespace tasks {

				namespace api {

				void set_task_manager(http_context& ctx, httpd::routes& r, sharded<tasks::task_manager>& tm, db::config& cfg);

				void set_task_manager(http_context& ctx, httpd::routes& r, sharded<tasks::task_manager>& tm, db::config& cfg, sharded<gms::gossiper>& gossiper);

				void unset_task_manager(http_context& ctx, httpd::routes& r);

				}

									
										32

api/task_manager_test.cc
									
												View File
												
				@@ -6,6 +6,9 @@

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#include "build_mode.hh"

				#ifndef SCYLLA_BUILD_MODE_RELEASE

				#include <seastar/core/coroutine.hh>

				@@ -54,20 +57,16 @@ void set_task_manager_test(http_context& ctx, routes& r, sharded<tasks::task_man

				    tmt::register_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        sharded<tasks::task_manager>& tms = tm;

				        auto it = req->query_parameters.find("task_id");

				        auto id = it != req->query_parameters.end() ? tasks::task_id{utils::UUID{it->second}} : tasks::task_id::create_null_id();

				        it = req->query_parameters.find("shard");

				        unsigned shard = it != req->query_parameters.end() ? boost::lexical_cast<unsigned>(it->second) : 0;

				        it = req->query_parameters.find("keyspace");

				        std::string keyspace = it != req->query_parameters.end() ? it->second : "";

				        it = req->query_parameters.find("table");

				        std::string table = it != req->query_parameters.end() ? it->second : "";

				        it = req->query_parameters.find("entity");

				        std::string entity = it != req->query_parameters.end() ? it->second : "";

				        it = req->query_parameters.find("parent_id");

				        const auto id_param = req->get_query_param("task_id");

				        auto id = !id_param.empty() ? tasks::task_id{utils::UUID{id_param}} : tasks::task_id::create_null_id();

				        const auto shard_param = req->get_query_param("shard");

				        unsigned shard = shard_param.empty() ? 0 : boost::lexical_cast<unsigned>(shard_param);

				        std::string keyspace = req->get_query_param("keyspace");

				        std::string table = req->get_query_param("table");

				        std::string entity = req->get_query_param("entity");

				        tasks::task_info data;

				        if (it != req->query_parameters.end()) {

				            data.id = tasks::task_id{utils::UUID{it->second}};

				        if (auto parent_id = req->get_query_param("parent_id"); !parent_id.empty()) {

				            data.id = tasks::task_id{utils::UUID{parent_id}};

				            auto parent_ptr = co_await tasks::task_manager::lookup_task_on_all_shards(tm, data.id);

				            data.shard = parent_ptr->get_status().shard;

				        }

				@@ -85,7 +84,7 @@ void set_task_manager_test(http_context& ctx, routes& r, sharded<tasks::task_man

				    });

				    tmt::unregister_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->query_parameters["task_id"]}};

				        auto id = tasks::task_id{utils::UUID{req->get_query_param("task_id")}};

				        try {

				            co_await tasks::task_manager::invoke_on_task(tm, id, [] (tasks::task_manager::task_variant task_v, tasks::virtual_task_hint) -> future<> {

				                return std::visit(overloaded_functor{

				@@ -106,9 +105,8 @@ void set_task_manager_test(http_context& ctx, routes& r, sharded<tasks::task_man

				    tmt::finish_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};

				        auto it = req->query_parameters.find("error");

				        bool fail = it != req->query_parameters.end();

				        std::string error = fail ? it->second : "";

				        std::string error = req->get_query_param("error");

				        bool fail = !error.empty();

				        try {

				            co_await tasks::task_manager::invoke_on_task(tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_variant task_v, tasks::virtual_task_hint) -> future<> {

									
										156

api/tasks.cc
									
												View File
												
				@@ -12,6 +12,7 @@

				#include "api/api.hh"

				#include "api/storage_service.hh"

				#include "api/api-doc/tasks.json.hh"

				#include "api/api-doc/storage_service.json.hh"

				#include "compaction/compaction_manager.hh"

				#include "compaction/task_manager_module.hh"

				#include "service/storage_service.hh"

				@@ -25,96 +26,163 @@ extern logging::logger apilog;

				namespace api {

				namespace t = httpd::tasks_json;

				namespace ss = httpd::storage_service_json;

				using namespace json;

				using ks_cf_func = std::function<future<json::json_return_type>(http_context&, std::unique_ptr<http::request>, sstring, std::vector<table_info>)>;

				static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {

				    return [&ctx, f = std::move(f)](std::unique_ptr<http::request> req) {

				        auto keyspace = validate_keyspace(ctx, req);

				        auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");

				        auto [keyspace, table_infos] = parse_table_infos(ctx, *req);

				        return f(ctx, std::move(req), std::move(keyspace), std::move(table_infos));

				    };

				}

				static future<shared_ptr<compaction::major_keyspace_compaction_task_impl>> force_keyspace_compaction(http_context& ctx, std::unique_ptr<http::request> req) {

				    auto& db = ctx.db;

				    auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");

				    auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);

				    auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);

				    apilog.info("force_keyspace_compaction: keyspace={} tables={}, flush={} consider_only_existing_data={}", keyspace, table_infos, flush, consider_only_existing_data);

				    auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				    std::optional<compaction::flush_mode> fmopt;

				    if (!flush && !consider_only_existing_data) {

				        fmopt = compaction::flush_mode::skip;

				    }

				    return compaction_module.make_and_start_task<compaction::major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);

				}

				static future<shared_ptr<compaction::upgrade_sstables_compaction_task_impl>> upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) {

				    auto& db = ctx.db;

				    bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);

				    apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);

				    auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				    return compaction_module.make_and_start_task<compaction::upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);

				}

				static future<shared_ptr<compaction::cleanup_keyspace_compaction_task_impl>> force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				    auto& db = ctx.db;

				    auto [keyspace, table_infos] = parse_table_infos(ctx, *req);

				    const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();

				    if (rs.is_local() || !rs.is_vnode_based()) {

				        auto reason = rs.is_local() ? "require" : "support";

				        apilog.info("Keyspace {} does not {} cleanup", keyspace, reason);

				        co_return nullptr;

				    }

				    apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);

				    if (!co_await ss.local().is_vnodes_cleanup_allowed(keyspace)) {

				        auto msg = "Can not perform cleanup operation when topology changes";

				        apilog.warn("force_keyspace_cleanup: keyspace={} tables={}: {}", keyspace, table_infos, msg);

				        co_await coroutine::return_exception(std::runtime_error(msg));

				    }

				    auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				    co_return co_await compaction_module.make_and_start_task<compaction::cleanup_keyspace_compaction_task_impl>(

				        {}, std::move(keyspace), db, table_infos, compaction::flush_mode::all_tables, tasks::is_user_task::yes);

				}

				void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& snap_ctl) {

				    t::force_keyspace_compaction_async.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        auto params = req_params({

				            std::pair("keyspace", mandatory::yes),

				            std::pair("cf", mandatory::no),

				            std::pair("flush_memtables", mandatory::no),

				        });

				        params.process(*req);

				        auto keyspace = validate_keyspace(ctx, *params.get("keyspace"));

				        auto table_infos = parse_table_infos(keyspace, ctx, params.get("cf").value_or(""));

				        auto flush = params.get_as<bool>("flush_memtables").value_or(true);

				        apilog.debug("force_keyspace_compaction_async: keyspace={} tables={}, flush={}", keyspace, table_infos, flush);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        std::optional<flush_mode> fmopt;

				        if (!flush) {

				            fmopt = flush_mode::skip;

				        }

				        auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt);

				        auto task = co_await force_keyspace_compaction(ctx, std::move(req));

				        co_return json::json_return_type(task->get_status().id.to_sstring());

				    });

				    ss::force_keyspace_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto task = co_await force_keyspace_compaction(ctx, std::move(req));

				        co_await task->done();

				        co_return json_void();

				    });

				    t::force_keyspace_cleanup_async.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        auto keyspace = validate_keyspace(ctx, req);

				        auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");

				        apilog.info("force_keyspace_cleanup_async: keyspace={} tables={}", keyspace, table_infos);

				        if (!co_await ss.local().is_cleanup_allowed(keyspace)) {

				            auto msg = "Can not perform cleanup operation when topology changes";

				            apilog.warn("force_keyspace_cleanup_async: keyspace={} tables={}: {}", keyspace, table_infos, msg);

				            co_await coroutine::return_exception(std::runtime_error(msg));

				        tasks::task_id id = tasks::task_id::create_null_id();

				        auto task = co_await force_keyspace_cleanup(ctx, ss, std::move(req));

				        if (task) {

				            id = task->get_status().id;

				        }

				        co_return json::json_return_type(id.to_sstring());

				    });

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<cleanup_keyspace_compaction_task_impl>({}, std::move(keyspace), db, table_infos, flush_mode::all_tables, tasks::is_user_task::yes);

				        co_return json::json_return_type(task->get_status().id.to_sstring());

				    ss::force_keyspace_cleanup.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto task = co_await force_keyspace_cleanup(ctx, ss, std::move(req));

				        if (task) {

				            co_await task->done();

				        }

				        co_return json::json_return_type(0);

				    });

				    t::perform_keyspace_offstrategy_compaction_async.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {

				        apilog.info("perform_keyspace_offstrategy_compaction: keyspace={} tables={}", keyspace, table_infos);

				        auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, table_infos, nullptr);

				        auto task = co_await compaction_module.make_and_start_task<compaction::offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, table_infos, nullptr);

				        co_return json::json_return_type(task->get_status().id.to_sstring());

				    }));

				    ss::perform_keyspace_offstrategy_compaction.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {

				        apilog.info("perform_keyspace_offstrategy_compaction: keyspace={} tables={}", keyspace, table_infos);

				        bool res = false;

				        auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<compaction::offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, table_infos, &res);

				        co_await task->done();

				        co_return json::json_return_type(res);

				    }));

				    t::upgrade_sstables_async.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);

				        apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);

				        auto task = co_await upgrade_sstables(ctx, std::move(req), std::move(keyspace), std::move(table_infos));

				        co_return json::json_return_type(task->get_status().id.to_sstring());

				    }));

				    ss::upgrade_sstables.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {

				        auto task = co_await upgrade_sstables(ctx, std::move(req), std::move(keyspace), std::move(table_infos));

				        co_await task->done();

				        co_return json::json_return_type(0);

				    }));

				    t::scrub_async.set(r, [&ctx, &snap_ctl] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        auto info = co_await parse_scrub_options(ctx, snap_ctl, std::move(req));

				        auto info = parse_scrub_options(ctx, std::move(req));

				        if (!info.snapshot_tag.empty()) {

				            co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, db::snapshot_ctl::skip_flush::no);

				        }

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<scrub_sstables_compaction_task_impl>({}, std::move(info.keyspace), db, std::move(info.column_families), info.opts, nullptr);

				        auto task = co_await compaction_module.make_and_start_task<compaction::scrub_sstables_compaction_task_impl>({}, std::move(info.keyspace), db, std::move(info.column_families), info.opts, nullptr);

				        co_return json::json_return_type(task->get_status().id.to_sstring());

				    });

				    ss::force_compaction.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);

				        auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);

				        apilog.info("force_compaction: flush={} consider_only_existing_data={}", flush, consider_only_existing_data);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        std::optional<compaction::flush_mode> fmopt;

				        if (!flush && !consider_only_existing_data) {

				            fmopt = compaction::flush_mode::skip;

				        }

				        auto task = co_await compaction_module.make_and_start_task<compaction::global_major_compaction_task_impl>({}, db, fmopt, consider_only_existing_data);

				        co_await task->done();

				        co_return json_void();

				    });

				}

				void unset_tasks_compaction_module(http_context& ctx, httpd::routes& r) {

				    t::force_keyspace_compaction_async.unset(r);

				    ss::force_keyspace_compaction.unset(r);

				    t::force_keyspace_cleanup_async.unset(r);

				    ss::force_keyspace_cleanup.unset(r);

				    t::perform_keyspace_offstrategy_compaction_async.unset(r);

				    ss::perform_keyspace_offstrategy_compaction.unset(r);

				    t::upgrade_sstables_async.unset(r);

				    ss::upgrade_sstables.unset(r);

				    t::scrub_async.unset(r);

				    ss::force_compaction.unset(r);

				}

				}

Compare commits

4886 Commits dani-tweig ... copilot/ad

1 .gitattributes vendored Unescape Escape View File

14 .github/CODEOWNERS vendored Unescape Escape View File

97 .github/ISSUE_TEMPLATE/bug_report.yml vendored Unescape Escape View File

86 .github/copilot-instructions.md vendored Normal file Unescape Escape View File

115 .github/instructions/cpp.instructions.md vendored Normal file Unescape Escape View File

51 .github/instructions/python.instructions.md vendored Normal file Unescape Escape View File

115 .github/scripts/auto-backport.py vendored Unescape Escape View File

81 .github/scripts/check-license.py vendored Executable file Unescape Escape View File

20 .github/scripts/sync_labels.py vendored Unescape Escape View File

16 .github/seastar-bad-include.json vendored Normal file Unescape Escape View File

34 .github/workflows/add-label-when-promoted.yaml vendored Unescape Escape View File

12 .github/workflows/call_jira_status_in_progress.yml vendored Normal file Unescape Escape View File

12 .github/workflows/call_jira_status_in_review.yml vendored Normal file Unescape Escape View File

12 .github/workflows/call_jira_status_ready_for_merge.yml vendored Normal file Unescape Escape View File

52 .github/workflows/check-license-header.yaml vendored Normal file Unescape Escape View File

2 .github/workflows/clang-nightly.yaml vendored Unescape Escape View File

1 .github/workflows/clang-tidy.yaml vendored Unescape Escape View File

143 .github/workflows/conflict_reminder.yaml vendored Unescape Escape View File

34 .github/workflows/docs-validate-metrics.yml vendored Normal file Unescape Escape View File

24 .github/workflows/iwyu.yaml vendored Unescape Escape View File

29 .github/workflows/make-pr-ready-for-review.yaml vendored Normal file Unescape Escape View File

4 .github/workflows/pr-require-backport-label.yaml vendored Unescape Escape View File

7 .github/workflows/seastar.yaml vendored Unescape Escape View File

4 .github/workflows/sync-labels.yaml vendored Unescape Escape View File

21 .github/workflows/trigger-scylla-ci.yaml vendored Normal file Unescape Escape View File

242 .github/workflows/trigger_ci.yaml vendored Normal file Unescape Escape View File

50 .github/workflows/trigger_jenkins.yaml vendored Normal file Unescape Escape View File

58 .github/workflows/urgent_issue_reminder.yml vendored Normal file Unescape Escape View File

2 .gitignore vendored Unescape Escape View File

3 .gitmodules vendored Unescape Escape View File

141 CMakeLists.txt Unescape Escape View File

2 CONTRIBUTING.md Unescape Escape View File

69 HACKING.md Unescape Escape View File

2 LICENSE-ScyllaDB-Source-Available.md Unescape Escape View File

3 NOTICE.txt Unescape Escape View File

4 README.md Unescape Escape View File

2 SCYLLA-VERSION-GEN Unescape Escape View File

4 alternator/CMakeLists.txt Unescape Escape View File

1 alternator/auth.cc Unescape Escape View File

11 alternator/consumed_capacity.cc Unescape Escape View File

6 alternator/consumed_capacity.hh Unescape Escape View File

8 alternator/controller.cc Unescape Escape View File

6 alternator/controller.hh Unescape Escape View File

6 alternator/error.hh Unescape Escape View File

2862 alternator/executor.cc View File

83 alternator/executor.hh Unescape Escape View File

24 alternator/expressions.cc Unescape Escape View File

29 alternator/expressions.g Unescape Escape View File

24 alternator/expressions.hh Unescape Escape View File

4 alternator/expressions_types.hh Unescape Escape View File

73 alternator/extract_from_attrs.hh Normal file Unescape Escape View File

109 alternator/parsed_expression_cache.cc Normal file Unescape Escape View File

21 alternator/rmw_operation.hh Unescape Escape View File

70 alternator/serialization.cc Unescape Escape View File

3 alternator/serialization.hh Unescape Escape View File

486 alternator/server.cc Unescape Escape View File

38 alternator/server.hh Unescape Escape View File

197 alternator/stats.cc Unescape Escape View File

75 alternator/stats.hh Unescape Escape View File

69 alternator/streams.cc Unescape Escape View File

154 alternator/ttl.cc Unescape Escape View File

5 api/CMakeLists.txt Unescape Escape View File

58 api/api-doc/compaction_manager.json Unescape Escape View File

8 api/api-doc/gossiper.json Unescape Escape View File

56 api/api-doc/service_levels.json Normal file Unescape Escape View File

393 api/api-doc/storage_service.json Unescape Escape View File

44 api/api-doc/task_manager.json Unescape Escape View File

8 api/api-doc/tasks.json Unescape Escape View File

92 api/api.cc Unescape Escape View File

91 api/api.hh Unescape Escape View File

30 api/api_init.hh Unescape Escape View File

30 api/cache_service.cc Unescape Escape View File

7 api/cache_service.hh Unescape Escape View File

1 api/collectd.cc Unescape Escape View File

662 api/column_family.cc View File

62 api/column_family.hh Unescape Escape View File

71 api/compaction_manager.cc Unescape Escape View File

4 api/compaction_manager.hh Unescape Escape View File

4886 Commits

dani-tweig ... copilot/ad

1

.gitattributes vendored

View File

14

.github/CODEOWNERS vendored

View File

97

.github/ISSUE_TEMPLATE/bug_report.yml vendored

View File

86

.github/copilot-instructions.md vendored Normal file

View File

115

.github/instructions/cpp.instructions.md vendored Normal file

View File

51

.github/instructions/python.instructions.md vendored Normal file

View File

115

.github/scripts/auto-backport.py vendored

View File

81

.github/scripts/check-license.py vendored Executable file

View File

20

.github/scripts/sync_labels.py vendored

View File

16

.github/seastar-bad-include.json vendored Normal file

View File

34

.github/workflows/add-label-when-promoted.yaml vendored

View File

12

.github/workflows/call_jira_status_in_progress.yml vendored Normal file

View File

12

.github/workflows/call_jira_status_in_review.yml vendored Normal file

View File

12

.github/workflows/call_jira_status_ready_for_merge.yml vendored Normal file

View File

52

.github/workflows/check-license-header.yaml vendored Normal file

View File

2

.github/workflows/clang-nightly.yaml vendored

View File

1

.github/workflows/clang-tidy.yaml vendored

View File

143

.github/workflows/conflict_reminder.yaml vendored

View File

34

.github/workflows/docs-validate-metrics.yml vendored Normal file

View File

24

.github/workflows/iwyu.yaml vendored

View File

29

.github/workflows/make-pr-ready-for-review.yaml vendored Normal file

View File

4

.github/workflows/pr-require-backport-label.yaml vendored

View File

7

.github/workflows/seastar.yaml vendored

View File

4

.github/workflows/sync-labels.yaml vendored

View File

21

.github/workflows/trigger-scylla-ci.yaml vendored Normal file

View File

242

.github/workflows/trigger_ci.yaml vendored Normal file

View File

50

.github/workflows/trigger_jenkins.yaml vendored Normal file

View File

58

.github/workflows/urgent_issue_reminder.yml vendored Normal file

View File

2

.gitignore vendored

View File

3

.gitmodules vendored

View File

141

CMakeLists.txt

View File

2

CONTRIBUTING.md

View File

69

HACKING.md

View File

2

LICENSE-ScyllaDB-Source-Available.md

View File

3

NOTICE.txt

View File

4

README.md

View File

2

SCYLLA-VERSION-GEN

View File

4

alternator/CMakeLists.txt

View File

1

alternator/auth.cc

View File

11

alternator/consumed_capacity.cc

View File

6

alternator/consumed_capacity.hh

View File

8

alternator/controller.cc

View File

6

alternator/controller.hh

View File

6

alternator/error.hh

View File

2862

alternator/executor.cc

View File

83

alternator/executor.hh

View File

24

alternator/expressions.cc

View File

29

alternator/expressions.g

View File

24

alternator/expressions.hh

View File

4

alternator/expressions_types.hh

View File

73

alternator/extract_from_attrs.hh Normal file

View File

109

alternator/parsed_expression_cache.cc Normal file

View File

21

alternator/rmw_operation.hh

View File

70

alternator/serialization.cc

View File

3

alternator/serialization.hh

View File

486

alternator/server.cc

View File

38

alternator/server.hh

View File

197

alternator/stats.cc

View File

75

alternator/stats.hh

View File

69

alternator/streams.cc

View File

154

alternator/ttl.cc

View File

5

api/CMakeLists.txt

View File

58

api/api-doc/compaction_manager.json

View File

8

api/api-doc/gossiper.json

View File

56

api/api-doc/service_levels.json Normal file

View File

393

api/api-doc/storage_service.json

View File

44

api/api-doc/task_manager.json

View File

8

api/api-doc/tasks.json

View File

92

api/api.cc

View File

91

api/api.hh

View File

30

api/api_init.hh

View File

30

api/cache_service.cc

View File

7

api/cache_service.hh

View File

1

api/collectd.cc

View File

662

api/column_family.cc

View File

62

api/column_family.hh

View File

71

api/compaction_manager.cc

View File

4

api/compaction_manager.hh

View File

34

api/config.cc

View File