scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-09 00:13:31 +00:00

Author	SHA1	Message	Date
Avi Kivity	e2eeef3e01	Merge 'service level: remove remnants of version 1 service level' from Gleb Natapov can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well. Also drop async versions of find_effective_service_level and get_user_scheduling_group since they are unused. No need to backport, code removal, Closes scylladb/scylladb#29002 * github.com:scylladb/scylladb: service level: make maybe_update_per_service_level_params synchronous service level: remove unused get_user_scheduling_group function service level: drop async find_effective_service_level service level: remove remnants of version 1 service level	2026-03-12 23:39:41 +02:00
Botond Dénes	eed3a6d407	sstables/mx/writer: move post-cell write yield to collection write loop Introduced by `54bddeb3b5`, the yield was added to write_cell(), to also help the general case where there is no collection. Arguably this was unnecessary and this patch moves the yield to write_collection(), to the cell write loop instead, so regular cells don't have to poll the preempt flag. Closes scylladb/scylladb#29013	2026-03-12 21:26:35 +02:00
Avi Kivity	e8a6706d6e	Merge 'shorten some sleeps to speed up bootstrap in tests' from Patryk Jędrzejczak This PR shortens two sleeps from 1s to 100ms to speed up bootstrap in tests. The changed sleeps are: - the pause duration in group0 discovery, - the retry period in `wait_for_cql`. Refs: https://scylladb.atlassian.net/browse/SCYLLADB-918 No backport: performance improvements mostly relevant to tests. Closes scylladb/scylladb#29020 * github.com:scylladb/scylladb: test: pylib: util: wait for CQL being ready with a shorter period group0: discovery: shorten the pause duration	2026-03-12 21:17:05 +02:00
Avi Kivity	76b6784c1a	Merge 'cql3: track CQL parsing memory cost and use it for admission control' from Marcin Maliszkiewicz Use rolling_max_tracker to record gross bytes allocated during each CQL parse. The rolling maximum is then added to the memory estimate for incoming QUERY and PREPARE requests so that the admission control in the CQL transport layer accounts for parsing overhead. The measured memory footprint serves as upper bound rather than exact number but it's purpose is to prevent OOMs under unprepared statements heavy load. In benchmark 1G memory node shows decrease of non-LSA memory usage from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as memory admission kicks in trying to prevent OOM. This is phase 1 of OOM prevention, potential next steps: - add second admission in query_processor::get_statement trying to prevent potential thundering herd problem - decrease cql_server memory pool size - count reads in the memory pool - add per service level memory pool and a shared one Related https://scylladb.atlassian.net/browse/SCYLLADB-740 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938 Backport: no, new feature, but we may reconsider if some customer needs it Closes scylladb/scylladb#28919 * github.com:scylladb/scylladb: cql3: track CQL parsing memory cost and use it for admission control utils: add rolling max tracker	2026-03-12 19:59:52 +02:00
Alex	7fd39ba586	test/cluster: strengthen raft voters multi-DC test and tune debug runtime The test_raft_voters_multidc_kill_dc scenario had become weaker after group0 voter count was made always odd. In particular, the old num_nodes == 1 case (dc1=2, dc2=1, dc3=1) could pass even without the intended balancing logic, because with 3 voters total we naturally get one voter per DC. This change restores coverage of the original intent: - Replace num_nodes parametrization with explicit DC triples. - Use (3, 1, 1) to force a meaningful asymmetric topology where voter placement logic is required. - Keep a larger topology case (6, 3, 3) for broader coverage. - Mark (6, 3, 3) as skip_mode(debug) with reason: larger topology case is too slow in debug on minipcs. Also updated comments/docstring to match the new setup. Fixes: SCYLLADB-794 backport: None, it is done to deflake minipcs that will start working only on master Closes scylladb/scylladb#29000	2026-03-12 17:07:45 +01:00
Marcin Maliszkiewicz	975cd60e05	ldap: fix use-after-move crash in ldap_reuser::reap() After stop() moved _reaper, in-flight with_connection() callbacks could still call reap(), which accessed the moved-from future causing a SIGSEGV in future_base::detach_promise(). Add a seastar::gate so stop() waits for all in-flight operations before moving _reaper. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1043 Closes scylladb/scylladb#29015	2026-03-12 16:48:45 +02:00
Patryk Jędrzejczak	c50cf32793	test: pylib: util: wait for CQL being ready with a shorter period `wait_for_cql` is used in hundreds, if not thousands, of places in tests. We shouldn't waste up to 1s for every call. Also, the 1s period is clearly too long compared to the bootstrap time, which is usually 0-3s in dev mode. The following test speeds up from 50s to 42s with the change: ``` for _ in range(10): servers = await manager.servers_add(3) await manager.get_ready_cql(servers) ```	2026-03-12 15:40:19 +01:00
Patryk Jędrzejczak	f85628a9a0	group0: discovery: shorten the pause duration Nodes currently pause group0 discovery for 1s. This case is always hit while adding multiple nodes in parallel to an empty cluster by all nodes except the one that becomes the group0 leader. This is fine in production, but in tests, the slowdown is quite significant. Every `manager.servers_add(n)` call for n > 1 becomes 1s slower when the cluster is empty. Many cluster tests are affected. In this commit, we decrease the sleep duration from 1s to 100ms to speed up tests. The consequence of this change is that nodes might perform more steps in group0 discovery, but the increase in CPU usage and network traffic should be negligible.	2026-03-12 15:40:18 +01:00
Gleb Natapov	c67f876893	service level: make maybe_update_per_service_level_params synchronous It does not call async functions any more.	2026-03-12 15:53:08 +02:00
Gleb Natapov	c30907b8f2	service level: remove unused get_user_scheduling_group function	2026-03-12 14:28:26 +02:00
Gleb Natapov	a934d8391d	service level: drop async find_effective_service_level find_cached_effective_service_level does exactly same thing now and it is synchronous.	2026-03-12 14:28:26 +02:00
Botond Dénes	15cfa5beeb	mutation/collection_mutation: don't copy the serialized collection serialize_collection_mutation() copies the serialized collection into the returned collection_mutation object. Change to move to avoid the copy. Fixes: SCYLLADB-1041 Closes scylladb/scylladb#29010	2026-03-12 13:57:40 +02:00
Gleb Natapov	f888f2dced	service level: remove remnants of version 1 service level can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well.	2026-03-12 12:27:52 +02:00
Nadav Har'El	27f0510280	test/alternator: test_gzip_request_oversized now passes on AWS The Alternator test test_compressed_request.py::test_gzip_request_oversized checks that a very large request that compresses to a small size is still rejected. This test passed on Alternator, but used to fail on DynamoDB because DynamoDB didn't reject this case. This was a bug in DynamoDB (a "decompression bomb" vulnerability), and after I reported it, it was fixed. So now this test does pass on DynamoDB (after a small modification to allow for different error codes). So remove its scylla_only marker, and make the comment true to the current state. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28820	2026-03-12 10:41:56 +01:00
Marcin Maliszkiewicz	b277d9d9aa	cql3: track CQL parsing memory cost and use it for admission control Use rolling_max_tracker to record gross bytes allocated during each CQL parse. The rolling maximum is then added to the memory estimate for incoming QUERY and PREPARE requests so that the admission control in the CQL transport layer accounts for parsing overhead. The measured memory footprint serves as upper bound rather than exact number but it's purpose is to prevent OOMs under unprepared statements heavy load. In benchmark 1G memory node shows decrease of non-LSA memory usage from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as memory admission kicks in trying to prevent OOM.	2026-03-12 10:16:10 +01:00
Botond Dénes	0b19a6de85	tombstone_gc: tombstone_gc_state::for_tests(): remove unused param Closes scylladb/scylladb#28923	2026-03-12 10:01:42 +01:00
Marcin Maliszkiewicz	2d22eea2f9	Merge 'cql3: Replace SCYLLA_ASSERT and abort by throwing_assert' from Nadav Har'El In this patch we replace every single use of SCYLLA_ASSERT(), abort() and assert() in the cql3/ directory by throwing_assert(). The problem with SCYLLA_ASSERT()/abort()/assert() is that when it fails, it crashes Scylla. This is almost always a bad idea (see #7871 discussing why), but it's even riskier in front-end code like cql3/: In front-end code, there is a risk that due to a bug in our code, a specific user request can cause Scylla to crash. A malicious user can send this query to all nodes and crash the entire cluster. When the user is not malicious, it causes a small problem (a failing request) to become a much worse crash - and worse, the user has no idea which request is causing this crash and the crash will repeat if the same request is tried again. All of this is solved by using the new throwing_assert(), which is the same as SCYLLA_ASSERT() but throws an exception (using on_internal_error()) instead of crashing. The exception will prevent the code path with the invalid assumption from continuing, but will result in only the current user request being aborted, with a clear error message reporting the internal server error due to an assertion failure. I reviewed all the changes that I did in these patches to check that (to the best of my understanding) none of the assertions in cql3/ involve the sort of serious corruption that might require crashing the Scylla node entirely. throwing_assert() also improves logging of assertion failures compared to the original SCYLLA_ASSERT()/abort() - SCYLLA_ASSERT() printed a message to stderr which in many installations is lost, and abort() often prints no message at all. But throwing_assert() uses Scylla's standard logger, and also includes a backtrace in the log message. Fixes #13970 (Exorcise assertions from CQL code paths) Refs #7871 (Exorcise assertions from Scylla) Closes scylladb/scylladb#28847 * github.com:scylladb/scylladb: cql3: remove unnecessary assert() cql3: replace abort() by throwing_assert() cql3: Replace SCYLLA_ASSERT by throwing_assert	2026-03-12 09:09:24 +01:00
Szymon Malewski	3116db6c2d	test: fix `testJsonOrdering` The `test/cqlpy/cassandra_tests/validation/entities/json_test.py::testJsonOrdering` was failing because of differences between Cassandra and Scylla in printing JSON floating point values - e.g. Cassandra prints 30.0, where Scylla prints 30. Both are valid, so in this patch, instead of comparing strings, we compare parsed JSON using `EquivalentJson`. Fixes #28467 Closes scylladb/scylladb#28924	2026-03-12 09:07:08 +01:00
Marcin Maliszkiewicz	5b2a07b408	utils: add rolling max tracker We will use it later to track parser memory usage via per query samples. Tests runtime in dev: 1.6s	2026-03-12 08:56:41 +01:00
Nadav Har'El	09a399ae3c	Merge 'Replace estimated_histogram with approx_exponential_histogram - alternator' from Amnon Heiman _"A journey of a thousand miles begins with a single step" Lao Tzu_ ScyllaDB uses estimated_histogram in many places. We already have a more efficient alternative: approx_exponential_histogram. It is both CPU and memory-efficient and can be exported as Prometheus native histograms. Its main limitation (which has its benefits) is that the bucket layout is fixed at compile time, so histograms with different configurations cannot be mixed. The end goal is to replace all uses of estimated_histogram in the codebase. That migration needs a few small API adjustments, so I am splitting the work into steps for easier review. This series is the first step. It introduces a base template for fixed-size estimated histograms, and switches the Alternator's estimated_histogram with the template. This change is self-contained and valuable on its own, while keeping the scope limited. Minor adjustments were made to the code and tests so that the tests would pass. Follow-up PRs will apply the same pattern to the rest of the code. New feature no need to backport Closes scylladb/scylladb#28987 * github.com:scylladb/scylladb: alternator: migrate to operation_size_kb histograms test/alternator/test_metrics.py: Update the bucket in the histogram search alternator: Use batch_histogram for batch size histograms estimated_histogram.hh: adds estimated_histogram_with_max	2026-03-12 00:06:16 +02:00
Amnon Heiman	1339a44163	alternator: migrate to operation_size_kb histograms Switch Alternator operation-size metrics from the legacy estimated histogram implementation to estimated_histogram_with_max<512> and export them through the native approx-exponential histogram path. Add a dedicated operation-size histogram type alias based on estimated_histogram_with_max<512>. Replace all per-operation size histograms (GetItem/PutItem/DeleteItem/ UpdateItem/BatchGetItem/BatchWriteItem) with the new type. Remove the custom legacy histogram-to-metrics adapter and use to_metrics_histogram() for operation size metrics, aligning export behavior with other approx-exponential histograms. Update Alternator metrics tests to compute expected le bucket boundaries using approx-exponential bucket math (including deduplication of equal bounds), so assertions match the new exported histogram schema. Update bucket helper signatures to use (max, precision) parameters and keep +Inf handling unchanged. Replace byte-to-KB ceiling conversion with plain integer division (bytes / 1024): histogram export already reports each bucket by its upper bound (le), so rounding input values up before bucketing is unnecessary and would over-shift borderline samples into higher buckets.	2026-03-11 17:29:14 +02:00
David	79f9967eaa	docs: update theme 1.9 Motivation Upgrades Sphinx to 9.x, MyST Parser to 5.x, Python to 3.11+–3.14, Node.js to 22, and replaces Poetry with uv for dependency management. Changelog: https://github.com/scylladb/sphinx-scylladb-theme/blob/master/docs/source/upgrade/CHANGELOG.md#190---26-february-2026 How to test * Make sure you are using Python 3.11-3.14: * python --version * Install uv: * make setupenv * Build the docs: * make preview * Docs should render without errors at http://127.0.0.1:5500 Closes scylladb/scylladb#28971	2026-03-11 16:56:51 +02:00
Aleksandra Martyniuk	2e68f48068	nodetool: cluster repair: do not fail if a table was dropped nodetool cluster repair without additional params repairs all tablet keyspaces in a cluster. Currently, if a table is dropped while the command is running, all tables are repaired but the command finishes with a failure. Modify nodetool cluster repair. If a table wasn't specified (i.e. all tables are repaired), the command finishes successfully even if a table was dropped. If a table was specified and it does not exist (e.g. because it was dropped before the repair was requested), then the behavior remains unchanged. Fixes: SCYLLADB-568. Closes scylladb/scylladb#28739	2026-03-11 16:35:04 +02:00
Dani Tweig	45d7d9a96c	.github/workflow: also call call_sync_milestone_to_jira.yml for close milestone event What changed * Added closed to milestone event types in call_sync_milestone_to_jira.yml (types: [created] -> types: [created, closed]) * Added VECTOR to the list of Jira project keys being synced (jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG -> jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR) Why (Requirements Summary) * The call_sync_milestone_to_jira.yml workflow only triggered on milestone creation. When a GitHub milestone is closed, the corresponding Jira versions (in SCYLLADB, CUSTOMER, SMI, RELENG projects) should be marked as released. Adding the closed trigger enables the called workflow (main_sync_milestone_to_jira_release.yml in github-automation) to handle both creating and releasing Jira versions from GitHub milestone events. * Added the VECTOR project so its Jira versions are also created/released when milestones are created or closed in scylladb.git. * This is consistent with the same change already applied to the staging and scylla-machine-image repos. Fixes:PM-216 Update call_sync_milestone_to_jira.yml in scylladb.git - add close trigger and VECTOR project sync Closes scylladb/scylladb#28981	2026-03-11 15:56:55 +02:00
Amnon Heiman	69fbcd32bd	test/alternator/test_metrics.py: Update the bucket in the histogram search	2026-03-11 15:24:05 +02:00
Amnon Heiman	50af1f3671	alternator: Use batch_histogram for batch size histograms Switch batch-related histograms to estimated_histogram_with_max. Results with better memory consumption and improve efficiency.	2026-03-11 15:21:25 +02:00
Amnon Heiman	b22162c719	estimated_histogram.hh: adds estimated_histogram_with_max This patch adds estimated_histogram_with_max template that will be a based for specific estimated_histograms, eventually replacing the current struct implementation. Introduce estimated_histogram_with_max<Max> as a reusable wrapper around approx_exponential_histogram<1, Max, 4>, providing merge support and the same add helpers used by existing estimated_histogra type. Add estimated_histogram_with_max_merge() Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2026-03-11 15:02:37 +02:00
Radosław Cybulski	fe8117feee	alternator: fix shard's parent calculation for vnodes Fix an invalid condition, when searching for a parent shard, when table is based on vnodes. Shards have associated with them `last token` - token, than marks the end of the range of tokens they consume (inclusive). An additional assumptions are whole token space is used and (for vnodes) token space wraps around. Previously code looked like this: auto pid = std::upper_bound(..., [](const dht::token& t, const cdc::stream_id& id) { return t < id.token(); }); if (pid != pids.begin()) { pid = std::prev(pid); } An `upper_bound` call with `t < id.token()` means it is looking for an iterator, for which value `t < id.token()` changed to true, which effectively means a position, where iterator is bigger then searched value. Then we move iterator backward once if possible. Assuming token space <-2, 2> and parents [0, 2], when we search for: - -1 -> we will get 0, it's first, so we can't move backward, so 0 (ok) - 0 -> we will get 2, it's not first, so we go back and we return 0 (ok) - 1 -> we will get 2, it's not first, so we go back and we return 0 (not ok - should be 2) The fix is to replace it with `std::lower_bound` and remove conditional backward motion. Since we've a guarantees that whole token space is used if `std::lower_bound` ends with `end()` value, then we have a wrap around case and we need to pick `begin()` as result. Fixes #28354 Fixes: SCYLLADB-537 Closes scylladb/scylladb#28382	2026-03-11 14:51:42 +02:00
Piotr Dulikowski	d9a277453e	Merge 'cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race' from Alex Dathskovsky query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically. This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops. Test coverage was extended in test/cluster/test_prepare_race.py: - reproduces the invalidation-during-prepare window with injection, - verifies prepare completes successfully, - then invalidates again and executes the same stale client prepared object, - confirms the driver transparently re-requests/re-prepares and execution succeeds. This change introduces: - no behavior change for normal prepare flow besides stronger lifetime guarantees, - no new protocol semantics, - preserves existing cache invalidation logic, - adds explicit cluster-level regression coverage for both the race and driver reprepare path. - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage Fixes: https://github.com/scylladb/scylladb/issues/27657 Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness. Closes scylladb/scylladb#28952 * github.com:scylladb/scylladb: transport/messages: hold pinned prepared entry in PREPARE result cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race	2026-03-11 12:09:23 +01:00
Patryk Jędrzejczak	37aeba9c8c	Merge 'raft: add global read barrier to group0_batch::commit and switch auth and service levels' from Marcin Maliszkiewicz This series adds a global read barrier to raft_group0_client, ensuring that Raft group0 mutations are applied on all live nodes before returning to the caller. Currently, after a group0_batch::commit, the mutations are only guaranteed to be applied on the leader. Other nodes may still be catching up, leading to stale reads. This patch introduces a broadcast read barrier mechanism. Calling send_group0_read_barrier_to_live_members after committing will cause the coordinator to send a read barrier RPC to all live nodes (discovered via gossiper) and waits for them to complete. This is best effort attempt to get cluster-wide visibility of the committed state before the response is returned to the user. Auth and service levels write paths are switched to use this new mechanism. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-650 Backport: no, new feature Closes scylladb/scylladb#28731 * https://github.com/scylladb/scylladb: test: add tests for global group0_batch barrier feature qos: switch service levels write paths to use global group0_batch barrier auth: switch write paths to use global group0_batch barrier raft: add function to broadcast read barrier request raft: add gossiper dependency to raft_group0_client raft: add read barrier RPC	2026-03-11 10:37:19 +01:00
Botond Dénes	54bddeb3b5	sstables/mx/writer: yield after writing a cell With the goal of avoiding stalls on writing large collections, like below: ++[0#1/1 100%] addr=0x5422d1e total=32 count=1 avg=32: \| seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}> at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:85 ++ - addr=0x541b6d4: \| seastar::backtrace_buffer::append_backtrace_oneline at ./build/release/seastar/./seastar/src/core/reactor.cc:811 \| (inlined by) seastar::print_with_backtrace at ./build/release/seastar/./seastar/src/core/reactor.cc:838 ++ - addr=0x541afb7: \| seastar::internal::cpu_stall_detector::generate_trace at ./build/release/seastar/./seastar/src/core/reactor.cc:1479 ++ - addr=0x541b86c: \| seastar::internal::cpu_stall_detector::maybe_report at ./build/release/seastar/./seastar/src/core/reactor.cc:1214 \| (inlined by) seastar::internal::cpu_stall_detector::on_signal at ./build/release/seastar/./seastar/src/core/reactor.cc:1234 \| (inlined by) seastar::reactor::block_notifier at ./build/release/seastar/./seastar/src/core/reactor.cc:1548 /opt/scylladb/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=f83d43b9b4b0ed5c2bd0a1613bf33e08ee054c93, for GNU/Linux 3.2.0, not stripped ++ - addr=/opt/scylladb/libreloc/libc.so.6+0x1a28f: \| sigpending at ??:0 ++ - addr=0x1760bf6: \| std::basic_string_view<signed char, std::char_traits<signed char> >::remove_prefix at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/string_view:302 \| (inlined by) managed_bytes_basic_view<(mutable_view)0>::remove_prefix at ././utils/managed_bytes.hh:421 \| (inlined by) _Z11read_simpleIlTk14FragmentedView24managed_bytes_basic_viewIL12mutable_view0EEET_RT0_ at ././utils/fragment_range.hh:365 \| (inlined by) _ZL9get_fieldIlTk14FragmentedView24managed_bytes_basic_viewIL12mutable_view0EEQsr3stdE12is_trivial_vIT_EES3_T0_j at ././mutation/atomic_cell.hh:62 \| (inlined by) atomic_cell_type::timestamp at ././mutation/atomic_cell.hh:103 \| (inlined by) basic_atomic_cell_view<(mutable_view)0>::timestamp at ././mutation/atomic_cell.hh:232 \| (inlined by) sstables::mc::writer::write_cell at ./sstables/mx/writer.cc:1101 \| (inlined by) sstables::mc::writer::write_collection(bytes_ostream&, clustering_key_prefix const, column_definition const&, collection_mutation_view, sstables::mc::writer::row_time_properties const&, bool)::$_0::operator() at ./sstables/mx/writer.cc:1233 \| (inlined by) collection_mutation_view::with_deserialized<sstables::mc::writer::write_collection(bytes_ostream&, clustering_key_prefix const, column_definition const&, collection_mutation_view, sstables::mc::writer::row_time_properties const&, bool)::$_0> at ././mutation/collection_mutation.hh:97 \| (inlined by) sstables::mc::writer::write_collection at ./sstables/mx/writer.cc:1221 ++ - addr=0x1677af3: \| sstables::mc::writer::write_cells at ./sstables/mx/writer.cc:1261 \| (inlined by) sstables::mc::writer::write_row_body at ./sstables/mx/writer.cc:1287 \| (inlined by) sstables::mc::writer::write_clustered at ./sstables/mx/writer.cc:1377 \| (inlined by) _ZN8sstables2mc6writer15write_clusteredI14clustering_rowQ9ClusteredIT_EEEvRKS4_9tombstone at ./sstables/mx/writer.cc:766 \| (inlined by) sstables::mc::writer::consume at ./sstables/mx/writer.cc:1425 Putting the yield in write_cell() instead of in write_collection() means that writing any row benefits from the added yield point in the middle. Refs: SCYLLADB-964 Closes scylladb/scylladb#28948	2026-03-11 10:34:55 +01:00
Botond Dénes	475220b9c9	Merge 'Remove the rest of pre raft topology code' from Gleb Natapov Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more. No need to backport since we remove functionality here. Closes scylladb/scylladb#28841 * github.com:scylladb/scylladb: service level: remove version 1 service level code features: move GROUP0_SCHEMA_VERSIONING to deprecated features list migration_manager: remove unused forward definitions test: remove unused code auth: drop auth_migration_listener since it does nothing now schema: drop schema_registry_entry::maybe_sync() function schema: drop make_table_deleting_mutations since it should not be needed with raft schema: remove calculate_schema_digest function schema: drop recalculate_schema_version function and its uses migration_manager: drop check for group0_schema_versioning feature cdc: drop usage of cdc_local table and v1 generation definition storage_service: no need to add yourself to the topology during reboot since raft state loading already did it storage_service: remove unused functions group0: drop with_raft() function from group0_guard since it always returns true now gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more gossiper: drop tokens from loaded_endpoint_state gossiper: remove unused functions storage_service: do not pass loaded_peer_features to join_topology() storage_service: remove unused fields from replacement_info gossiper: drop is_safe_for_restart() function and its use storage_service: remove unused variables from join_topology gossiper: remove the code that was only used in gossiper topology storage_service: drop the check for raft mode from recovery code cdc: remove legacy code test: remove unused injection points auth: remove legacy auth mode and upgrade code treewide: remove schema pull code since we never pull schema any more raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer group0: hoist the checks for an illegal upgrade into main.cc api: drop get_topology_upgrade_state and always report upgrade status as done service_level_controller: drop service level upgrade code test: drop run_with_raft_recovery parameter to cql_test_env group0: get rid of group0_upgrade_state storage_service: drop topology_change_kind as it is no longer needed storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more service_storage: remove unused functions storage_service: remove non raft rebuild code storage_service: set topology change kind only once group0: drop in_recovery function and its uses group0: rename use_raft to maintenance_mode and make it sync	2026-03-11 10:24:20 +02:00
Piotr Dulikowski	38a2829f69	Merge 'Return HTTP error description in Vector Store client' from Szymon Wasik The `service_error` struct: `6dc2c42f8b/service/vector_store_client.hh (L64)` currently stores just the error status code. For this reason whenever the HTTP error occurs, only the error code can be forwarded to the client. For example see here: `6dc2c42f8b/service/vector_store_client.cc (L580)` For this reason in the output of the drivers full description of the error is missing which forces user to take a look into Scylla server logs. The objective of this PR is to extend the support for HTTP errors in Vector Store client to handle messages as well. Moreover, it removes the quadratic reallocation in response_content_to_sstring() helper function that is used for getting the response in case of error. Fixes: VECTOR-189 Closes scylladb/scylladb#26139 * github.com:scylladb/scylladb: vector_search: Avoid quadratic reallocation in response_content_to_sstring vector_store_client: Return HTTP error description, not just code	2026-03-11 09:19:27 +01:00
Calle Wilund	6d8ac23731	test_encryption: Use maximum replication in _smoke_test Refs: SCYLLADB-557 We should use full replication in KS/CF creation and population, for at least two reasons: 1.) Ensure we wait fully for and write to all nodes 2.) Make test more "real", behaving like a proper cluster Closes scylladb/scylladb#28959	2026-03-11 09:54:57 +02:00
Nadav Har'El	00a819bcd8	cql3: remove unnecessary assert() In cql3/, there was one call to assert() (not SCYLLA_ASSERT or throwing_assert), and it was: const auto shard_num = smp::count; assert(shard_num > 0) Rather than converting this assert() to throwing_assert() as I did in previous patches, I decided to outright remove it: Seastar guarantees that smp::count is not zero. Many other places in the code use smp::count assuming that it is correct, no other place bothers to assert it isn't zero. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-03-11 09:43:24 +02:00
Nadav Har'El	34eec020b3	cql3: replace abort() by throwing_assert() After the previous patch replaced all SCYLLA_ASSERT() calls by throwing_assert(), this patch also replaces all calls to abort(). All these abort() calls are supposedly cases that can never happen, but if they ever do happen because of a bug, in none of these places we absolutely need to crash - and exception that aborts the current operation should be enough. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-03-11 09:43:11 +02:00
Nadav Har'El	c87d6407ed	cql3: Replace SCYLLA_ASSERT by throwing_assert In this patch we replace every single use of SCYLLA_ASSERT() in the cql3/ directory by throwing_assert(). The problem with SCYLLA_ASSERT() is that when it fails, it crashes Scylla. This is almost always a bad idea (see #7871 discussing why), but it's even riskier in front-end code like cql3/: In front-end code, there is a risk that due to a bug in our code, a specific user request can cause Scylla to crash. A malicious user can send this query to all nodes and crash the entire cluster. When the user is not malicious, it causes a small problem (a failing request) to become a much worse crash - and worse, the user has no idea which request is causing this crash and the crash will repeat if the same request is tried again. All of this is solved by using the new throwing_assert(), which is the same as SCYLLA_ASSERT() but throws an exception (using on_internal_error()) instead of crashing. The exception will prevent the code path with the invalid assumption from continuing, but will result in only the current user request being aborted, with a clear error message reporting the internal server error due to an assertion failure. I reviewed all the changes that I did in this patch to check that (to the best of my understanding) none of the assertions in cql3/ involve the sort of serious corruption that might require crashing the Scylla node entirely. throwing_assert() also improves logging of assertion failures compared to the original SCYLLA_ASSERT() - SCYLLA_ASSERT() printed a message to stderr which in many installations is lost, whereas throwing_assert() uses Scylla's standard logger, and also includes a backtrace in the log message. Fixes #13970 (Exorcise assertions from CQL code paths) Refs #7871 (Exorcise assertions from Scylla) Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-03-11 09:41:20 +02:00
Botond Dénes	99fa912f1b	Merge 'Generalize streaming scopes tests' from Pavel Emelyanov To restore how streaming scopes work there are two tests that greatly duplicate each other -- test_restore_with_streaming_scopes from cluster/object_store suite and test_refresh_with_streaming_scopes from cluster suite. This patch generalizes both into a do_test_streaming_scopes() non-test function Closes scylladb/scylladb#28874 * github.com:scylladb/scylladb: test: Re-sort comments around do_test_streaming_scopes() test: Split do_load_sstables() test: Drop load_fn argument from do_load_sstables() test: Re-use do_test_streaming_scopes() in refresh test test: Introduce SSTablesOnLocalStorage test: Introduce SSTablesOnObjectStorage test: Move test_restore_with_streaming_scopes() into do_test_streaming_scopes()	2026-03-11 09:35:21 +02:00
Dmitriy Kruglov	cee44716db	docs: add cluster platform migration procedure Document how to migrate a ScyllaDB cluster to different instance types using the add-and-replace node cycling approach. Closes: QAINFRA-42 Closes scylladb/scylladb#28458	2026-03-11 09:31:35 +02:00
Nadav Har'El	401dc1894c	test/alternator,cqlpy: avoid xfail_strict against DynamoDB/Cassandra Recently, in commit `7b30a39`, we added to pytest.ini the option xfail_strict. This option causes every time a test XPASSes, i.e., an XFAIL test actually passes, to be considered an error and fail the test. While this has some benefits, it's a big problem when running tests against a reference implementation like DynamoDB or Cassandra: We typically mark a test "xfail" if the test shows a known bug - i.e., if the test fails on Scylla but passes on the reference system (DynamoDB or Cassandra). This means that when running "test/cqlpy/run-cassandra" or "test/alternator/run --aws", we expect to see many tests XPASS, and now this will cause these runs to "fail". So in this patch we add the xfail_strict=false to cqlpy/run-cassandra and alternator/run --aws. This option is not added to cqlpy/run or to alternator/run without --aws, and also doesn't affect test.py or Jenkins. P.S. This is another nail in the coffin of doing "cd test/alternator; pytest --aws". You should get used to running Alternator tests through test/alternator/run, even if you don't need to run Scylla (the "--aws" option doesn't run Scylla). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28973	2026-03-11 09:29:30 +02:00
Robert Bindar	29619e48d7	replica/table: calculate manifest tablet_count from tablet map During tests I noticed that if the number of tablets is very small, say 2, and the number of nodes is 3 (2 shards per node), using the number of storage groups on each shard, a shard may end up holding 0 groups, whilst the other holds 1 group. And in some nodes even both shards have 0 groups. Taking the minimum among shards here was showing in manifests a tablet count of 0 for all 3 nodes, which is incorrect. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#28978	2026-03-11 09:27:04 +02:00
Botond Dénes	3fed6f9eff	Merge 'service: tasks: scan all tablets in tablet_virtual_task::wait' from Aleksandra Martyniuk Currently, for repair tasks tablet_virtual_task::wait gathers the ids of tablets that are to be repaired. The gathered set is later used to check if the repair is still ongoing. However, if the tablets are resized (split or merged), the gathered set becomes irrelevant. Those, we may end up with invalid tablet id error being thrown. Wait until repair is done for all tablets in the table. Fixes: https://github.com/scylladb/scylladb/issues/28202 Backport to 2026.1 needed as it contains the change introducing the issue `d51b1fea94` Closes scylladb/scylladb#28323 * github.com:scylladb/scylladb: service: fix indentation test: add test_tablet_repair_wait service: remove status_helper::tablets service: tasks: scan all tablets in tablet_virtual_task::wait	2026-03-11 09:24:07 +02:00
Raphael S. Carvalho	cc5b1acadf	Improve log when sstable load fails due to missing tablet replica A bug or some bad operator intervention can lead to a sstable existing in a node after the tablet replica was moved to a different node. This will result sstable loading during boot failing, requiring operator intervention. The log today just dumps the name of the "orphaned" sstable, but one investigating it might want to know which process (repair, memtable, whatever) generated that sstable, if the sstable was created locally or remotely, and the current replica set of the underlying tablet. From the original identifier, we can know the exact time the sstable was created on its original node. From the current id, we know the time it was created on the current node. All this info can help the investigator to correlate with events in other nodes (includes actions from the coordinator) to get closer to the root cause. The new log will look like this: "Unable to load SSTable .../me-3gyg_1fsw_2u0u826b00b71vc46o-big-Data.db (originated from compaction with id 913f41c0-18c2-11f1-8f08-cb8521b3f330 on host e483238c-2287-4022-8bc4-b4f1c4cb2b0d) of tablet 6 (replica set: [e483238c-2287-4022-8bc4-b4f1c4cb2b0d:0])" Refs https://scylladb.atlassian.net/browse/SCYLLADB-788. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#28921	2026-03-11 06:20:34 +02:00
Avi Kivity	b17e1259e3	Merge 'types: optimize vector deserialization for high-dimensional vectors' from Szymon Wasik Vector deserialization is an operation which performance is critical for vector similarity search feature because it is frequently executed during rescoring operation. Some of the identified performance bottlenecks for it include: 1. Per-element virtual dispatch in deserialize(): each of the N elements went through visit() which switches on ~28 type variants. For a 1024-dimension float vector, that's 1024 redundant type switches when the element type is the same for all of them. 2. Redundant work in split_fragmented(): value_length_if_fixed() was called inside the loop (N virtual calls), and no reserve() was done on the output vector causing repeated reallocations. This series fixes both: - Introduce deserialize_vector_visitor that dispatches on the element type once for the entire vector, then loops inside the resolved handler. Simple numeric types (float, int, etc.) call deserialize_value() directly with no virtual dispatch per element. String types (ascii, utf8) get a dedicated handler that skips make_empty() (sstring has no empty_t constructor). Complex types (list, map, tuple, etc.) fall back to per-element dispatch. - In split_fragmented(), reserve the output vector to _dimension and cache value_length_if_fixed() before the loop. Benchmark results (1024-dim float vector, release build, -O3 -flto): deserialize: 15.73 us -> 11.70 us (1.34x, 26% faster) split_fragmented: 10.34 us -> 7.45 us (1.39x, 28% faster) References: SCYLLADB-471 Backport: none, unless we observe some critical performance improvement for quantization. Closes scylladb/scylladb#28618 * github.com:scylladb/scylladb: types: optimize reading vector fragments types: optimize vector deserialization for high-dimensional vectors	2026-03-11 00:39:46 +02:00
Dawid Mędrek	167feabe1a	cql3: Reject user-provided timestamps for strongly consistent tables Similarly to LWTs, we reject queries with user-provided timestamps when they target strongly consistent tables. Such statements could force us to rewrite history, and that contradicts the philosophy of linearizability we aim for. Fixes SCYLLADB-879 Closes scylladb/scylladb#28867	2026-03-10 22:11:39 +02:00
Marcin Maliszkiewicz	8ae80a32c0	Update seastar submodule * seastar d2953d2a...4d268e0e (32): > Merge 'prometheus: support multiple __name__ filters and prefixed names' from Travis Downs doc: update prometheus.md with __name__ filter enhancements prometheus: support prefixed names in __name__ filter prometheus: add benchmarks for name filter performance prometheus: support multiple __name__ query parameters prometheus: move write_body_args to header > fair_queue: Subtract from _queued_capacity on pop_front() > memory: expose cumulative allocated bytes statistic > Merge 'Add ability to configure IO bandwidth limit for supergroup' from Pavel Emelyanov test: IO bandiwdth throttler unit tests code: Add ability to configure IO bandwidth limit for supergroup io_queue: Have more than one throttler par class io_queue: Introduce bandwidth_throttler helper class io_queue: Nest io_group::priotiy_class_data-s io_queue: Update class bandwidth on group's class data io_queue: Make io_group::priority_class_data::tokens() static fair_queue: Introduce group (un)plugging > Fix _shard_to_numa_node_mapping double population > Use exception parameter in log_timer_callback_exception() > Fix wakeup_granularity() fallback debug-fs reading > test_fixture: Fix SEASTAR_FIXTURE_THREAD_TEST_CASE thread not propagated > build: support tuning -ffile-prefix-map > test: Remove unused C::dup() method of testing class > src/core/reactor: introduce reactor::get_backend_name() > util/process: add pid() accessor > Merge 'Add source location to task and tasktrace object' from Radosław Cybulski coroutine.hh: disable source_location for GCC to avoid ICE reactor: improve do_dump_task_queue reporting Use source_location in `do_dump_task_queue` Update backtrace with source locations of resume points Add calls to update resume_point Add a std::source_location (resume_point) to task object. > Merge 'Refine posix file .dup() implementation' from Pavel Emelyanov file: Templatize posix_file_handle_impl file: Don't dup() non-read-only files file: Split ..._impl::dup() implementations test: Add a simple test for dup() > Merge 'Deprecate reactor::make_pollable_fd(socket_address, int)' from Pavel Emelyanov reactor: Deprecate make_pollable_fd() net/posix: Create file_desc for sockets in-place reactor,net: Keep sock_need_nonblock boolean on posix_network_stack net/posix: Re-format constructor initializer lists > Merge 'test: add fuzz testing infrastructure and sstring fuzzer' from Travis Downs test: add fuzz tests to CI workflow test: add sstring differential fuzzer test: add fuzz testing infrastructure > Introduce "integrated queue length" metrics and use it for IO classes (#3210) > reactor: Remove get_sg_data(unsigned) overload > memcached: Stop using scattered_message > reactor: Mark uptime() method const > alien: Remove deprecated run_on and submit_to calls > file: make open_flags and access_flags constexpr > scheduling: Unfriend some methods from scheduling_group > reactor: Move _dying bit to epoll backend > file: coroutinize the with_file templates > configure: validate --cook ingredient names > fix trailing whitespace > Merge 'Estimate timing overhead, allow failing if it is too high' from Travis Downs perf_tests: document overhead column and threshold options perf_tests: add measurement overhead tracking and warnings perf_tests: remove inline/hot attributes from time_measurement methods perf_tests: move time_measurement class to implementation file perf_tests: move perf counters into time_measurement singleton > rpc: log handler type > Merge 'Add pre-commit with trailing whitespace hook' from Travis Downs Add GitHub Actions workflow for pre-commit enforcement Add pre-commit setup documentation to HACKING.md Add pre-commit configuration with trailing-whitespace hook Remove trailing whitespace from source files > posix-stack: Make internal::posix_connect() resolve exceptions into futures > sstring: fix npos to be size_t for consistency with std::string Closes scylladb/scylladb#28954	2026-03-10 22:06:58 +02:00
Szymon Wasik	7fae78d2b0	types: optimize reading vector fragments There was a redundant work in split_fragmented(): value_length_if_fixed() was called inside the loop (N virtual calls), and no reserve() was done on the output vector causing repeated reallocations. This patch reserves the output vector to _dimension and caches value_length_if_fixed() before the loop. Additionally, split read_vector_element() into two specialized functions: read_vector_element_fixed() and read_vector_element_variable(), and hoist the branch on fixed_len outside the loop in split_fragmented() and deserialize_loop(). This avoids a conditional branch per element in the hot path. Benchmark results (1024-dim float vector, release build, -O3 -flto): 10.34 us -> 7.45 us (1.39x, 28% faster)	2026-03-10 20:17:31 +01:00
Szymon Wasik	6c0ef8eb92	types: optimize vector deserialization for high-dimensional vectors One of the performance bottlenecks while deserializing vectors was per-element virtual dispatch in deserialize(): each of the N elements went through visit() which switches on ~28 type variants. For a 1024-dimension float vector, that's 1024 redundant type switches when the element type is the same for all of them. This patch introduces deserialize_vector_visitor that dispatches on the element type once for the entire vector, then loops inside the resolved handler. Simple numeric types (float, int, etc.) call deserialize_value() directly with no virtual dispatch per element. String types (ascii, utf8) get a dedicated handler that skips make_empty() (sstring has no empty_t constructor). Complex types (list, map, tuple, etc.) fall back to per-element dispatch. Benchmark results (1024-dim float vector, release build, -O3 -flto): 15.73 us -> 11.70 us (1.34x, 26% faster)	2026-03-10 18:21:34 +01:00
Andrzej Jackowski	9247dff8c2	reader_concurrency_semaphore: fix leak workaround `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` introduces a protection against resources that are "made up" of thin air to `reader_concurrency_semaphore`. If there are more `_resources` than the `_initial_resources`, it means there is a negative leak, and `on_internal_error_noexcept` is called. In addition to it, `_resources` is set to `std::max(_resources, _initial_resources)`. However, the commit message of `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` states the opposite: "The detection also clamps the _resources to _initial_resources, to prevent any damage". Before this commit, the protection mechanism doesn't clamp `_resources` to `_initial_resources` but instead keeps `_resources` high, possibly even indefinitely growing. This commit changes `std::max` to `std::min` to make the code behave as intended. Refs: SCYLLADB-163 Closes scylladb/scylladb#28982	2026-03-10 18:57:31 +02:00
Szymon Wasik	74d86d3fe9	vector_search: Avoid quadratic reallocation in response_content_to_sstring Pre-compute the total size and allocate a single uninitialized sstring before copying the buffers, following the pattern from Seastar's read_entire_stream_contiguous(). This avoids iterative reallocation which is O(n^2) for large responses.	2026-03-10 17:45:55 +01:00

1 2 3 4 5 ...

52527 Commits