scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 03:56:42 +00:00

Author	SHA1	Message	Date
copilot-swe-agent[bot]	77ee7f3417	Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy" This reverts commit `8192f45e84`. The merge exposed a bug where truncate (via drop) fails and causes Raft errors, leading to schema inconsistencies across nodes. This results in test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist' errors. The specific problematic change was in commit `19b6207f` which modified truncate_table_on_all_shards to set use_sstable_identifier = true. This causes exceptions during truncate that are not properly handled, leading to Raft applier fiber stopping and nodes losing schema synchronization.	2025-12-12 03:55:13 +00:00
copilot-swe-agent[bot]	0ff89a58be	Initial plan	2025-12-12 03:48:12 +00:00
Yaron Kaikov	f7ffa395a8	workflows: trigger CI automatically when conflicts label is removed Add pull_request_target event with unlabeled type to trigger-scylla-ci workflow. This allows automatic CI triggering when the 'conflicts' label is removed from a PR, in addition to the existing manual trigger via comment. The workflow now runs when: - A user posts a comment with '@scylladbbot trigger-ci' (existing) - The 'conflicts' label is removed from a PR (new) Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-84 Closes scylladb/scylladb#27521	2025-12-11 16:48:06 +02:00
Piotr Smaron	3fa3b920de	Update CODEOWNERS to remove redundant entries Removing myself as I have no maintainer's permissions to review the code Closes scylladb/scylladb#27576	2025-12-11 16:47:08 +02:00
Botond Dénes	e7ca52ee79	Merge 'api: storage_service/tablets/repair: disable incremental repair by default' from Benny Halevy Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414 Backport to 2025.4 where `611918056a` was introduced Closes scylladb/scylladb#27530 * github.com:scylladb/scylladb: api: storage_service/tablets/repair: disable incremental repair by default docs: nodetool-commands: cluster: repair: fix incremental-mode example	2025-12-11 15:23:09 +02:00
Botond Dénes	730eca5dac	Merge 'Remove noexcept from storage_group and table functions to allow exception propagation' from null Fixed a critical bug where `storage_group::for_each_compaction_group()` was incorrectly marked `noexcept`, causing `std::terminate` when actions threw exceptions (e.g., `utils::memory_limit_reached` during memory-constrained reader creation). Changes made: 1. Removed `noexcept` from `storage_group::for_each_compaction_group()` declaration and implementation 2. Removed `noexcept` from `storage_group::compaction_groups()` overloads (they call for_each_compaction_group) 3. Removed `noexcept` from `storage_group::live_disk_space_used()` and `memtable_count()` (they call compaction_groups()) 4. Kept `noexcept` on `storage_group::flush()` - it's a coroutine that automatically captures exceptions and returns them as exceptional futures 5. Removed `noexcept` from `table_load_stats()` functions in base class, table, and storage group managers Rationale: As noted by reviewers, there's no reason to kill the server if these functions throw. For coroutines returning futures, `noexcept` is appropriate because Seastar automatically captures exceptions and returns them as exceptional futures. For other functions, proper exception handling allows the system to recover gracefully instead of terminating. Fixes #27475 Closes scylladb/scylladb#27476 * github.com:scylladb/scylladb: replica: Remove unnecessary noexcept replica: Remove noexcept from compaction_groups() functions replica: Remove noexcept from storage_group::for_each_compaction_group	2025-12-11 15:17:35 +02:00
Benny Halevy	c8cff94a5a	api: storage_service/tablets/repair: disable incremental repair by default Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-11 14:25:21 +02:00
Benny Halevy	5fae4cdf80	docs: nodetool-commands: cluster: repair: fix incremental-mode example There is no 'regular' incremental mode anymore. The example seems have meant 'disabled'. Fixes #27587 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-11 14:25:11 +02:00
Marcin Maliszkiewicz	8bbcaacba1	auth: always catch by const reference This is best practice. Closes scylladb/scylladb#27525	2025-12-11 12:42:30 +01:00
Yaron Kaikov	3dfa5ebd7f	Add JIRA issue validation to backport PR fixes check Extend the Fixes validation pattern to also accept JIRA issue references (format: [A-Z]+-\d+) in addition to GitHub issue references. This allows backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'. Fixes: https://github.com/scylladb/scylladb/issues/27571 Closes scylladb/scylladb#27572	2025-12-11 12:23:16 +02:00
Avi Kivity	24264e24bb	Revert "repair: Add tablet repair progress report support" This reverts commit `faad0167d7`. It causes a regression in test_two_tablets_concurrent_repair_and_migration_repair_writer_level in debug mode (with ~5%-10% probability). Fixes #27510. Closes scylladb/scylladb#27560	2025-12-11 12:18:11 +02:00
Nadav Har'El	0c64e3be9a	Merge 'Unify and fix rjson string and string_view conversions' from Marcin Maliszkiewicz This patch-set consolidates and corrects rjson string conversion handling. It removes unnecessary string copies, ensures proper length usage and replaces ad-hoc conversions with consistent helper functions. Overall, the changes make rjson string handling safer, faster, and more uniform across the codebase. Backport: no, it's a refactor Closes scylladb/scylladb#27394 * github.com:scylladb/scylladb: fix rjson::value to bytes conversion with missing GetStringLength call alternator: change type from string to string_view in should_add_capacity fix rjson::value to string_view conversion with missing GetStringLength call use rjson::to_string_view when rjson::value gets converted using GetStringLength use rjson::to_sstring and rjson::to_string for various string conversions utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds utils: move rjson::to_string_view func to string related place utils: add to_sstring and to_string rjson helper	2025-12-11 12:05:41 +02:00
Marcin Maliszkiewicz	d5b63df46e	transport: remove redundant futurize_invoke from counted data sink and source Closes scylladb/scylladb#27526	2025-12-11 10:32:16 +03:00
Dario Mirovic	f545ed37bc	test: dtest: audit_test.py: fix audit error log detection `test_insert_failure_doesnt_report_success` test in `test/cluster/dtest/audit_test.py` has an insert statement that is expected to fail. Dtest environment uses `FlakyRetryPolicy`, which has `max_retries = 5`. 1 initial fail and 5 retry fails means we expect 6 error audit logs. The test failed because `create keyspace ks` failed once, then succeeded on retry. It allowed the test to proceed properly, but the last part of the test that expects exactly 6 failed queries actually had 7. The goal of this patch is to make sure there are exactly 6 = 1 + `max_retries` failed queries, counting only the query expected to fail. If other queries fail with successful retry, it's fine. If other queries fail without successful retry, the test will fail, as it should in such situations. They are not related to this expected failed insert statement. Fixes #27322 Closes scylladb/scylladb#27378	2025-12-11 10:17:07 +03:00
Benny Halevy	5f13880a91	utils: error_injection: wait_for_message: print injection_name and caller source_location on timeout When waiting for the condition variable times out we call on_internal_error, but unfortunately, the backtrace it generates is obfuscated by `coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`. To make the log more useful, print the error injection name and the caller's source_location in the timeout error message. Fixes #27531 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27532	2025-12-10 23:25:54 +01:00
Tomasz Grabiec	0e51a1f812	replica: Remove unnecessary noexcept Can potentially lead to unnecessary abort. compaction_groups() and for_each_compaction_group() can throw. Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>	2025-12-10 14:51:35 +01:00
Tomasz Grabiec	8b807b299e	replica: Remove noexcept from compaction_groups() functions They can throw during merge, when the number of compaction groups is higher than 3. Callers can deal with that, so we shouldn't abort.	2025-12-10 14:48:23 +01:00
Tomasz Grabiec	07ff659849	replica: Remove noexcept from storage_group::for_each_compaction_group They don't really have to be noexcept. And "action" may actually throw, leading to abort. It was observed to throw when creating memtable readers: terminate called after throwing an instance of 'utils::memory_limit_reached' what(): kill limit triggered on semaphore sl:users by permit xxx Aborting on shard 4, in scheduling group sl:users. std::terminate() at ??:0 __clang_call_terminate at main.cc:0 replica::storage_group::for_each_compaction_group(std::function<void (seastar::lw_shared_ptr<replica::compaction_group> const&)>) const at ./replica/table.cc:920 (inlined by) replica::table::add_memtables_to_reader_list(std::vector<mutation_reader, std::allocator<mutation_reader>>&, seastar::lw_shared_ptr<schema const> const&, reader_permit const&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr const&, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>, std::function<void (unsigned long)>) const at ./replica/table.cc:196 (inlined by) replica::table::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:243 (inlined by) replica::table::as_mutation_source() const::$_0::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:3673 (inlined by) mutation_reader std::__invoke_impl<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(std::__invoke_other, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61 (inlined by) std::enable_if<is_invocable_r_v<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>, mutation_reader>::type std::__invoke_r<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:114 (inlined by) std::_Function_handler<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>), replica::table::as_mutation_source() const::$_0>::_M_invoke(std::_Any_data const&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290 (inlined by) std::function<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>)>::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591 (inlined by) mutation_source::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ././readers/mutation_source.hh:143 query::querier_base::querier_base(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, mutation_source const&, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:91 (inlined by) query::querier::querier(mutation_source const&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:164 (inlined by) replica::table::query(seastar::lw_shared_ptr<schema const>, reader_permit, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, query::result_memory_limiter&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::optional<query::querier>) at ./replica/table.cc:3583 replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0::operator()(reader_permit) const at ./replica/database.cc:1533 (inlined by) seastar::noncopyable_function<seastar::future<void> (reader_permit)>::indirect_vtable_for<replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0>::call(seastar::noncopyable_function<seastar::future<void> (reader_permit)> const, reader_permit) (.llvm.13537529942037499926) at ././seastar/include/seastar/util/noncopyable_function.hh:158 seastar::noncopyable_function<seastar::future<void> (reader_permit)>::operator()(reader_permit) const at ././seastar/include/seastar/util/noncopyable_function.hh:215 (inlined by) reader_concurrency_semaphore::execution_loop() (.resume) at ./reader_concurrency_semaphore.cc:980 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ./build/release/seastar/./seastar/include/seastar/core/coroutine.hh:122 (inlined by) seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2627 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3099 seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3267 seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0::operator()() const at ./build/release/seastar/./seastar/src/core/reactor.cc:4591 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(std::__invoke_other, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>, void>::type std::__invoke_r<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:111 (inlined by) std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290 std::function<void ()>::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591 Fixes #27475 Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>	2025-12-10 14:48:11 +01:00
Yaron Kaikov	d3e199984e	auto-backport.py: modify instruction for making PR ready for review Update the comment sent when PR has conflicts with clear instrauctions how to make the PR Ready for review Fixes: https://scylladb.atlassian.net/browse/RELENG-152 Closes scylladb/scylladb#27547	2025-12-10 14:53:38 +02:00
Nadav Har'El	8822c23ad4	Merge 'test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions thr…' from Dario Mirovic …eshold The initial problem: Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened. Test logic: These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests. The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values. Note that the exceptions are counted per shard, not per code path. The new problem: The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12. There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted. Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths. For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run). Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold. Also, this patch series enables debug logging for `exception` logger. This will allow us to inspect which exceptions happened if a protocol exceptions test fails again. Fixes #27247 Fixes #27325 Issue observed on master and branch-2025.4. The tests, in the same form, exist on master, branch-2025.4, branch-2025.3, branch-2025.2, and branch-2025.1. Code change is simple, and no issue is expected with backport automation. Thus, backports for all the aforementioned versions is requested. Closes scylladb/scylladb#27412 * github.com:scylladb/scylladb: test: cqlpy: test_protocol_exceptions.py: enable debug exception logging test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold	2025-12-10 10:53:30 +02:00
Marcin Maliszkiewicz	be9992cfb3	fix rjson::value to bytes conversion with missing GetStringLength call	2025-12-09 19:27:22 +01:00
Marcin Maliszkiewicz	daf00a7f24	alternator: change type from string to string_view in should_add_capacity It avoids allocation.	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	62962f33bb	fix rjson::value to string_view conversion with missing GetStringLength call In some cases we unnecessarily convert to string which causes a copy. In other we convert without calling GetStringLength which causes iteration to dermine length which is already known. In some cases we do even both. This commit fixes that.	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	060c2f7c0d	use rjson::to_string_view when rjson::value gets converted using GetStringLength This commit is only cosmetics, changes calls to GetStringLength into rjson::to_string_view with the same underlying implementation.	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	64149b57c3	use rjson::to_sstring and rjson::to_string for various string conversions In some cases we ommit size checking which is wrong as according to rapid json documentation strings may contain \0 byte in the middle.	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	4b004fcdfc	utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds So that we can use our common utility functions.	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	5e38b3071b	utils: move rjson::to_string_view func to string related place	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	225b3351fc	utils: add to_sstring and to_string rjson helper So that conversion code is common and it's easier to avoid accidental type conversions. Additionally according to rapid json library size must be checked explicitly, this also avoids extra iteration in char* to (s)string conversion.	2025-12-09 19:27:21 +01:00
Avi Kivity	80c6718ea8	build: update toolchain to Fedora 43 with clang 21.1.6 Rebase to Fedora 43 with clang 21.1 and libstdc++ 15. Fedora container image registry moved to registry.fedoraproject.org as it seems to be updated more regularly. Added python3-devel to the dependencies as some packages scylla-cqlsh depends on aren't yet available in the form of wheels for Python 3.14, and so have to be built locally. In any case it's better to reduce dependency on those wheels even if the ones currently missing appear eventually. Added libev-devel to the dependencies so that the python driver builds correctly even if "wheels" are not published. This reduces our dependency on the python driver's binary release schedule. Without libev-devel, TLS does not work correctly. We no long remove the clang and clang-libs packages. Doxygen started depending on clang-libs, and removing them removes doxygen, breaking the build when it looks for that. The build will still pick up the optimized clang, since /usr/local/bin is earlier in the path. We keep the clang package, since it allows us to mess a little less with the directory structure. Optimized clang binaries generates and stored in https://devpkg.scylladb.com/clang/clang-21.1.6-Fedora-43-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-21.1.6-Fedora-43-x86_64.tar.gz With ./scripts/refresh-pgo-profiles.sh, the new compiler shows a small performance improvement (instructions_per_op) in perf-simple-query: clang 21: 259353.60 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35720 insns/op, 17427 cycles/op, 0 errors) 265940.08 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35725 insns/op, 17042 cycles/op, 0 errors) 262650.01 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35720 insns/op, 17240 cycles/op, 0 errors) 262881.22 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35675 insns/op, 17222 cycles/op, 0 errors) 264898.68 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35732 insns/op, 17070 cycles/op, 0 errors) throughput: mean= 263144.72 standard-deviation=2528.69 median= 262881.22 median-absolute-deviation=1753.96 maximum=265940.08 minimum=259353.60 instructions_per_op: mean= 35714.47 standard-deviation=22.34 median= 35720.38 median-absolute-deviation=10.20 maximum=35732.14 minimum=35675.50 cpu_cycles_per_op: mean= 17200.12 standard-deviation=154.62 median= 17221.70 median-absolute-deviation=129.77 maximum=17427.33 minimum=17041.57 clang 20: 254431.39 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35883 insns/op, 17708 cycles/op, 0 errors) 259701.02 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35883 insns/op, 17351 cycles/op, 0 errors) 261166.92 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35912 insns/op, 17270 cycles/op, 0 errors) 260656.31 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35869 insns/op, 17289 cycles/op, 0 errors) 259628.13 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35946 insns/op, 17370 cycles/op, 0 errors) throughput: mean= 259116.75 standard-deviation=2698.56 median= 259701.02 median-absolute-deviation=1539.55 maximum=261166.92 minimum=254431.39 instructions_per_op: mean= 35898.42 standard-deviation=30.69 median= 35882.97 median-absolute-deviation=15.90 maximum=35945.63 minimum=35869.02 cpu_cycles_per_op: mean= 17397.49 standard-deviation=178.35 median= 17351.35 median-absolute-deviation=108.79 maximum=17707.63 minimum=17269.68 Closes scylladb/scylladb#26773	2025-12-09 15:16:31 +02:00
Pavel Emelyanov	855b91ec20	scripts: Make PR merging check more granular Currently we have 3 explicit checks, and some of them are configurable: - Jenkins job being stable. Can be disabled with --force - Whether submodule update is happenning. It's not allowed by default, and should be enabled with --allow-submodule option - Target branch checking (recently merged #27249). Happens unconditionally This PR unifies all checks in two ways. First, each restriction can be lifted with --allow-foo options. The existing --allow-submodule stays and two options are added: - --allow-unstable to skip jenkins job check (like --force works now) - --allow-any-branch to skip target branch check Second, the --force option lifts all the known restrictions. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27294	2025-12-09 13:58:21 +02:00
Nadav Har'El	95e303faf3	Merge 'Refactor get_view_natural_endpoint' from Wojciech Mitros With the introduction of rack-lists and the reliance of materialized views on them, the `get_view_natural_endpoint` function can be greatly simplified. When using tablets, instead of doing any index-matching, we can now pair base tables with views only in the same rack. In this series we remove no longer needed code and reorganize the needed code for better clarity. After the changes, the `get_view_natural_endpoint` function goes down from 245 lines to 85 lines, while the whole pairing-related text goes down from 346 lines to 239 lines. Fixes https://github.com/scylladb/scylladb/issues/26313 Closes scylladb/scylladb#27383 * github.com:scylladb/scylladb: mv: replace the simple/complex rack-aware pairing with exact rack matching mv: split out vnode pairing code from get_view_natural_endpoint mv: unify self-pairing and rack-aware pairing into one bool mv: remove the workaround for left nodes when sending view updates	2025-12-09 13:19:13 +02:00
Nadav Har'El	8ba595e472	Merge 'alternator: fix batch writes during intranode tablet migrations' from Petr Gusev Scylla implements `LWT` in the` storage_proxy::cas` method. This method expects to be called on a specific shard, represented by the `cas_shard` parameter. Clients must create this object before calling `storage_proxy::cas`, check its `this_shard()` method, and jump to `cas_shard.shard()` if it returns false. The nuance is that by the time the request reaches the destination shard, the tablet may have already advanced in its migration state machine. For example, a client may acquire a `cas_shard` at the `streaming` tablet state, then submit a request to another shard via `smp::submit_to(cas_shard.shard())`. However, the new `cas_shard` created on that other shard might already be in the `write_both_read_new` state, and its `cas_shard.shard()` would not be equal to `this_shard_id()`. Such broken invariant results in an `on_internal_error` in `storage_proxy::cas`. Clients of `storage_proxy::cas` are expected to check` cas_shard.this_shard()` and recursively jump to another shard if it returns false. Most calls to `storage_proxy::cas` already implement this logic. The only exception is `executor::do_batch_write`, which currently checks `cas_shard.this_shard()` only once. This can break the invariant if the tablet state changes more than once during the operation. This PR fixes the issue by implementing recursive `cas_shard.this_shard()` checks in `executor::do_batch_write`. It also adds a test that reproduces the problem. Fixes: scylladb/scylladb#27353 backport: need to be backported to 2025.4 Closes scylladb/scylladb#27396 * github.com:scylladb/scylladb: alternator/executor.cc: eliminate redundant dk copy alternator/executor.cc: release cas_shard on the original shard alternator/executor.cc: move shard check into cas_write alternator/executor.cc: make cas_write a private method alternator/executor.cc: make do_batch_write a private method alternator/executor.cc: fix indent test_alternator: add test_alternator_invalid_shard_for_lwt	2025-12-09 11:25:15 +02:00
Petr Gusev	608eee0357	alternator/executor.cc: eliminate redundant dk copy A small refactoring/optimization.	2025-12-09 10:21:06 +01:00
Petr Gusev	0bcc2977bb	alternator/executor.cc: release cas_shard on the original shard Before this series, we kept the cas_shard on the original shard to guard against tablet movements running in parallel with storage_proxy::cas. The bug addressed by this PR shows that this approach is flawed: keeping the cas_shard on the original shard does not guarantee that a new cas_shard acquired on the target shard won’t require another jump. We fixed this in the previous commit by checking cas_shard.this_shard() on the target shard and continuing to jump to another shard if necessary. Once cas_shard.this_shard() on the target shard returns true, the storage_proxy::cas invariants are satisfied, and no other cas_shard instances need to remain alive except the one passed into storage_proxy::cas.	2025-12-09 10:21:06 +01:00
Petr Gusev	3a865fe991	alternator/executor.cc: move shard check into cas_write This change ensures that if cas_shard points to a different shard, the executor will continue issuing shard jumps until cas_shard.this_shard() returns true. The commit simply moves the this_shard() check from the parallel_for_each lambda into cas_write, with minimal functional changes. We enable test_alternator_invalid_shard_for_lwt since now it should pass. Fixes scylladb/scylladb#27353	2025-12-09 10:21:01 +01:00
Pavel Emelyanov	fb32e1c7ee	Merge 'streaming: tablet_sstable_streamer::stream refactoring' from Ernest Zaslavsky Refactor the way we decide the sstable belong to a tablet, fully or partially to simplify the flow and make it more readable. Also extract the logic and make it testable, add tests to cover changes The change is purely aesthetic, no need to backport Closes scylladb/scylladb#27101 * github.com:scylladb/scylladb: streaming: remove unnecessary lambda creating sstable token range streaming: simplify get_sstables_for_tablets logic streaming: switch to range-based for loop streaming: drop sstable skip microoptimization in tablet loop streaming: replace reverse iterators with reverse view in sstables scan streaming: return from get_sstables_for_tablets earlier streaming: add get_sstables_by_tablet_range tests test,sstables: add helper to set sstable first and last keys streaming: refactor get_sstables_for_tablets to make it accessible streaming: refactor get_sstables_for_tablets to make it testable streaming: refactor tablet_sstable_streamer::stream by extracting SST filtering logic	2025-12-09 10:53:57 +03:00
Patryk Jędrzejczak	b6895f0fa7	test: make test_broken_bootstrap faster This change makes the test ~20 s faster. It's a forgotten follow-up: https://github.com/scylladb/scylladb/pull/18927#discussion_r1627331946 Closes scylladb/scylladb#27445	2025-12-09 09:25:42 +02:00
Dario Mirovic	c30b326033	test: cqlpy: test_protocol_exceptions.py: enable debug exception logging Enable debug logging for "exception" logger inside protocol exception tests. The exceptions will be logged, and it will be possible to see which ones occured if a protocol exceptions test fails. Refs #27272 Refs #27325	2025-12-09 01:35:42 +01:00
Dario Mirovic	807fc68dc5	test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold The initial problem: Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened. Test logic: These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests. The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values. Note that the exceptions are counted per shard, not per code path. The new problem: The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12. There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted. Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths. For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run). Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the exception burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold. Fixes #27247 Fixes #27325	2025-12-09 01:34:48 +01:00
Michał Jadwiszczak	51843195f7	test/boost/view_build_test: increase number of retires Default number of retires in `eventually()` in `test_builder_with_concurrent_drop` sometimes is not enough to observe changes in system tables on aarch64 builds. This patch increases the number of retries to 30. Fixes scylladb/scylladb#27370 Closes scylladb/scylladb#27493	2025-12-08 23:14:01 +02:00
Gleb Natapov	7038b8b544	test/scylla_cluster: fix the check that a process failed to start If the process is running returncode will be Node, otherwise it will have some value (which can be 0 s well) and the current code treats 0 as if the process is still running. Closes scylladb/scylladb#27490	2025-12-08 18:23:29 +01:00
Tomasz Grabiec	7df610b73d	sstables: Remove host id mismatch warning for sstable streaming Tablet migration transfers sstable files without changing origin host-id. As it should, becuase those sstables were not written on the destination host, and should be ignored by commit log replay. So it's a normal situation, and it's confusing to see this warning in logs. Fixes #26957 Closes scylladb/scylladb#27433	2025-12-08 18:39:22 +02:00
Piotr Dulikowski	386309d6a0	Merge 'Improve the way distributed-loader constructs storage_options for backup sstables' from Pavel Emelyanov The distributed_loader::get_sstables_from_object_store() method accepts an endpoint parameter and internally wants to get storage type for that endpoint (s3 or gcs). This is needed to construct storage_options object to create an sstable object. To get the type, the method scans db::config option, but there's much simpler way to get one. Code cleanup, no need to backport Closes scylladb/scylladb#27381 * github.com:scylladb/scylladb: sstables_loader: Provide endpoint type for get_sstables_from_object_store() storage_manager: Introduce get_endpoint_type() method storage_manager: Split get_endpoint_client()	2025-12-08 16:55:20 +01:00
Amnon Heiman	a213e41250	scylla-node-exporter: Add ethtool to node exporter AWS suggests following multiple network performance metrics: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-network-performance-ena.html#network-performance-metrics This patch enables the ethtool collector with the specific list of metrics Ater this patch the relevant metris looks like: $ curl http://localhost:9100/metrics \|& grep ethtool node_ethtool_bw_in_allowance_exceeded{device="ens5"} 0 node_ethtool_bw_out_allowance_exceeded{device="ens5"} 0 node_ethtool_conntrack_allowance_available{device="ens5"} 51303 node_ethtool_conntrack_allowance_exceeded{device="ens5"} 0 node_ethtool_info{bus_info="0000:00:05.0",device="ens5",driver="ena",expansion_rom_version="",firmware_version="",version="6.14.0-1015-aws"} 1 node_ethtool_linklocal_allowance_exceeded{device="ens5"} 0 node_scrape_collector_duration_seconds{collector="ethtool"} 0.001091436 node_scrape_collector_success{collector="ethtool"} 1 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes scylladb/scylladb#27358	2025-12-08 14:27:10 +02:00
Dawid Mędrek	58dc414912	test/cluster/mv: Rewrite test_view_building_scheduling_group We rewrite the test to avoid flakiness. Instead of looking at the metrics, we make a trade-off and start depending on a less reliable mechanism -- logs. We grep all relevant messages printed by Scylla in TRACE mode and make sure that they were all printed from a context using the streaming scheduling group. Although it's a "less proper" way of testing, it should be much more dependable and avoid flakiness. Fixes scylladb/scylladb#25957 Closes scylladb/scylladb#26656	2025-12-08 14:24:25 +02:00
Ferenc Szili	d883ff2317	test: fix flakyness caused by TRUNCATE retries The test test_truncate_during_topology_change tests TRUNCATE TABLE while bootstrapping a new node. With tablets enabled TRUNCATE is a global topology operation which needs to serialize with boostrap. When TRUNCATE TABLE is issued, it first checks if there is an already queued truncate for the same table. This can happen if a previous TRUNCATE operation has timed out, and the client retried. The newly issued truncate will only join the queued one if it is waiting to be processed, and will fail immediatelly if the TRUNCATE is already being processed. In this test, TRUNCATE will be retried after a timeout (1 minute) due to the default retry policy, and will be retried up to 3 times, while the bootstrap is delayed by 2 minutes. This means that the test can validate the result of a truncate which was started after bootstrap was completed. Because of the way truncate joins existing truncate operations, we can also have the following scenario: - TRUNCATE times out after one minute because the new node is being bootstrapped - the client retries the TRUNCATE command which also times out after 1m - the third attempt is received during TRUNCATE being processed which fails the test This patch changes the retry policy of the TRUNCATE operation to FallthroughRetryPolicy which guarantees that TRUNCATE will not be retried on timeout. It also increases the timeout of the TRUNCATE from 1 to 4 minutes. This way the test will actually validate the performance of the TRUNCATE operation which was issued during bootstrap, instead of the subsequent, retried TRUNCATEs which could have been issued after the bootstrap was complete. Fixes: #26347 Closes scylladb/scylladb#27245	2025-12-08 14:13:26 +02:00
dependabot[bot]	1f777da863	build(deps): bump sphinx-scylladb-theme from 1.8.9 to 1.8.10 in /docs Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.9 to 1.8.10. - [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases) - [Commits](https://github.com/scylladb/sphinx-scylladb-theme/commits) --- updated-dependencies: - dependency-name: sphinx-scylladb-theme dependency-version: 1.8.10 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#27468	2025-12-08 13:40:51 +02:00
Asias He	faad0167d7	repair: Add tablet repair progress report support This patch adds tablet repair progress report support so that the user could use the /task_manager/task_status API to query the progress. In order to support this, a new system table is introduced to record the user request related info, i.e, start of the request and end of the request. The progress is accurate when tablet split or merge happens in the middle of the request, since the tokens of the tablet are recorded when the request is started and when repair of each tablet is finished. The original tablet repair is considered as finished when the finished ranges cover the original tablet token ranges. After this patch, the /task_manager/task_status API will report correct progress_total and progress_completed. Fixes #22564 Fixes #26896 Closes scylladb/scylladb#26924	2025-12-08 13:35:19 +02:00
Andrei Chekun	0115a21b9a	test.py: fail test when timeout reached for boost test There is a bug in current pytest's boost implementation. When timeout reached process will be killed, but it was not correctly propagated, that lead to a false positive result. This will fail test case when timeout for the process is reached. This is to prevent issues like this https://github.com/scylladb/scylladb/issues/27237 Closes scylladb/scylladb#27463	2025-12-08 11:49:46 +01:00
Ernest Zaslavsky	71834ce7dd	streaming: remove unnecessary lambda creating sstable token range The `sstable_token_range` lambda was only used once to create a token range for an SSTable. Inline the construction directly where needed, removing the extra lambda. This simplifies the code without changing behavior.	2025-12-08 12:30:24 +02:00

1 2 3 4 5 ...

50899 Commits