scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 14:15:46 +00:00

Author	SHA1	Message	Date
Piotr Dulikowski	afe786843f	qos: don't populate effective service level cache until auth is migrated to raft Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 (cherry picked from commit `2bb800c004`)	2025-07-31 15:12:50 +00:00
Tomasz Grabiec	24435185d7	Merge '[Backport 2025.1] streaming: Avoid deadlock by running view checks in a separate scheduling group' from Scylladb[bot] This issue happens with removenode, when RBNO is disabled, so range streamer is used. The deadlock happens in a scenario like this: 1. Start 3 nodes: {A, B, C}, RF=2 2. Node A is lost 3. removenode A 4. Both B and C gain ownership of ranges. 5. Streaming sessions are started with crossed directions: B->C, C->B Readers created by sender side exhaust streaming semaphore on B and C. Receiver side attempts to obtain a permit indirectly by calling check_needs_view_update_path(), which reads local tables. That read is blocked and times-out, causing streaming to fail. The streaming writer is already using a tracking-only permit. Even if we didn't deadlock, and the streaming semaphore was simply exhausted by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation. To avoid that, run the query under a different scheduling group, which translates to the system semaphore instead of the maintenance semaphore, to break the dependency. The gossip group was chosen because it shouldn't be contended and this change should not interfere with it much. Fixes #24807 Fixes #24925 - (cherry picked from commit `ee2fa58bd6`) - (cherry picked from commit `dff2b01237`) Parent PR: #24929 Closes scylladb/scylladb#25052 * github.com:scylladb/scylladb: streaming: Avoid deadlock by running view checks in a separate scheduling group service: migration_manager: Run group0 barrier in gossip scheduling group	2025-07-30 02:22:51 +02:00
Tomasz Grabiec	2a9eecdb65	streaming: Avoid deadlock by running view checks in a separate scheduling group This issue happens with removenode, when RBNO is disabled, so range streamer is used. The deadlock happens in a scenario like this: 1. Start 3 nodes: {A, B, C}, RF=2 2. Node A is lost 3. removenode A 4. Both B and C gain ownership of ranges. 5. Streaming sessions are started with crossed directions: B->C, C->B Readers created by sender side exhaust streaming semaphore on B and C. Receiver side attempts to obtain a permit indirectly by calling check_needs_view_update_path(), which reads local tables. That read is blocked and times-out, causing streaming to fail. The streaming writer is already using a tracking-only permit. To avoid that, run the query under a different scheduling group, which translates to the system semaphore instead of the maintenance semaphore, to break the dependency. The gossip group was chosen because it shouldn't be contended and this change should not interfere with it much. Fixes: #24807 (cherry picked from commit `dff2b01237`)	2025-07-28 21:47:50 +02:00
Aleksandra Martyniuk	23479939b9	tasks: do not use binary progress for task manager tasks Currently, progress of a parent task depends on expected_total_workload, expected_children_number, and children progresses. Basically, if total workload is known or all children have already been created, progresses of children are summed up. Otherwise binary progress is returned. As a result, two tasks of the same type may return progress in different units. If they are children of the same task and this parent gathers the progress - it becomes meaningless. Drop expected_children_number as we can't assume that children are able to show their progresses. Modify get_progress method - progress is calculated based on children progresses. If expected_total_workload isn't specified, the total progress of a task may grow. If expected_total_workload isn't specified and no children are created, empty progress (0/0) is returned. Fixes: https://github.com/scylladb/scylladb/issues/24650. Closes scylladb/scylladb#25113 (cherry picked from commit `a7ee2bbbd8`) Closes scylladb/scylladb#25197	2025-07-28 09:23:57 +03:00
Aleksandra Martyniuk	3f93cdc61b	nodetool: repair: skip tablet keyspaces Currently, nodetool repair command repairs both vnode and tablet keyspaces if no keyspace is specified. We should use this command to repair only vnode keyspaces, but this isn't easily accessible - we have to explicitly run repair only on vnode keyspaces. nodetool repair skips tablet keyspaces unless a tablet keyspace is explicitely passed as an argument. Fixes: #24040. Closes scylladb/scylladb#24042 (cherry picked from commit `6f8b378e80`) Closes scylladb/scylladb#25152	2025-07-24 16:32:37 +03:00
Piotr Dulikowski	8298405805	auth: fix crash when migration code runs parallel with raft upgrade The functions password_authenticator::start and standard_role_manager::start have a similar structure: they spawn a fiber which invokes a callback that performs some migration until that migration succeeds. Both handlers set a shared promise called _superuser_created_promise (those are actually two promises, one for the password authenticator and the other for the role manager). The handlers are similar in both cases. They check if auth is in legacy mode, and behave differently depending on that. If in legacy mode, the promise is set (if it was not set before), and some legacy migration actions follow. In auth-on-raft mode, the superuser is attempted to be created, and if it succeeds then the promise is _unconditionally_ set. While it makes sense at a glance to set the promise unconditionally, there is a non-obvious corner case during upgrade to topology on raft. During the upgrade, auth switches from the legacy mode to auth on raft mode. Thus, if the callback didn't succeed in legacy mode and then tries to run in auth-on-raft mode and succeds, it will unconditionally set a promise that was already set - this is a bug and triggers an assertion in seastar. Fix the issue by surrounding the `shared_promise::set_value` call with an `if` - like it is already done for the legacy case. Backport note: the bugfix part for password_authenticator was removed from the commit because 2025.1 does not have scylladb/scylladb#22532 and thus does not contain the bug. Fixes: scylladb/scylladb#24975 Closes scylladb/scylladb#24976 (cherry picked from commit `a14b7f71fe`) Closes scylladb/scylladb#25017	2025-07-24 09:16:18 +02:00
Calle Wilund	e53d5ed0dc	utils::http::dns_connection_factory: Use a shared certificate_credentials Fixes #24447 This factory type, which is really more a data holder/connection producer per connection instance, creates, if using https, a new certificate_credentials on every instance. Which when used by S3 client is per client and scheduling groups. Which eventually means that we will do a set_system_trust + "cold" handshake for every tls connection created this way. This will cause both IO and cold/expensive certificate checking -> possible stalls/wasted CPU. Since the credentials object in question is literally a "just trust system", it could very well be shared across the shard. This PR adds a thread local static cached credentials object and uses this instead. Could consider moving this to seastar, but maybe this is too much. Closes scylladb/scylladb#24448 (cherry picked from commit `80feb8b676`) Closes scylladb/scylladb#24460	2025-07-22 11:02:59 +03:00
Michael Litvak	a57c51b9d7	tablets: stop storage group on deallocation When a tablet transitions to a post-cleanup stage on the leaving replica we deallocate its storage group. Before the storage can be deallocated and destroyed, we must make sure it's cleaned up and stopped properly. Normally this happens during the tablet cleanup stage, when table::cleanup_table is called, so by the time we transition to the next stage the storage group is already stopped. However, it's possible that tablet cleanup did not run in some scenario: 1. The topology coordinator runs tablet cleanup on the leaving replica. 2. The leaving replica is restarted. 3. When the leaving replica starts, still in `cleanup` stage, it allocates a storage group for the tablet. 4. The topology coordinator moves to the next stage. 5. The leaving replica deallocates the storage group, but it was not stopped. To address this scenario, we always stop the storage group when deallocating it. Usually it will be already stopped and complete immediately, and otherwise it will be stopped in the background. Fixes scylladb/scylladb#24857 Fixes scylladb/scylladb#24828 Closes scylladb/scylladb#24896 (cherry picked from commit `fa24fd7cc3`) Closes scylladb/scylladb#24906	2025-07-22 11:02:28 +03:00
Pavel Emelyanov	57b24383ed	Merge '[Backport 2025.1] Improve background disposal of tablet_metadata' from Scylladb[bot] As seen in #23284, when the tablet_metadata contains many tables, even empty ones, we're seeing a long queue of seastar tasks coming from the individual destruction of `tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>`. This change improves `tablet_metadata::clear_gently` to destroy the `tablet_map_ptr` objects on their owner shard by sorting them into vectors, per- owner shard. Also, background call to clear_gently was added to `~token_metadata`, as it is destroyed arbitrarily when automatic token_metadata_ptr variables go out of scope, so that the contained tablet_metadata would be cleared gently. Finally, a unit test was added to reproduce the `Too long queue accumulated for gossip` symptom and verify that it is gone with this change. Fixes #24814 Refs #23284 This change is not marked as fixing the issue since we still need to verify that there is no impact on query performance, reactor stalls, or large allocations, with a large number of tablet-based tables. * Since the issue exists in 2025.1, requesting backport to 2025.1 and upwards - (cherry picked from commit `3acca0aa63`) - (cherry picked from commit `493a2303da`) - (cherry picked from commit `e0a19b981a`) - (cherry picked from commit `2b2cfaba6e`) - (cherry picked from commit `2c0bafb934`) - (cherry picked from commit `4a3d14a031`) - (cherry picked from commit `6e4803a750`) Parent PR: #24618 Closes scylladb/scylladb#24862 * github.com:scylladb/scylladb: token_metadata_impl: clear_gently: release version tracker early test: topology_custom: test_tablets_merge: add test_tablet_split_merge_with_many_tables token_metadata: clear_and_destroy_impl when destroyed token_metadata: keep a reference to shared_token_metadata token_metadata: move make_token_metadata_ptr into shared_token_metadata class replica: database: get and expose a mutable locator::shared_token_metadata locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction	2025-07-22 11:02:00 +03:00
Aleksandra Martyniuk	5f07f9963d	replica: hold compaction group gate during flush Destructor of database_sstable_write_monitor, which is created in table::try_flush_memtable_to_sstable, tries to get the compaction state of the processed compaction group. If at this point the compaction group is already stopped (and the compaction state is removed), e.g. due to concurrent tablet merge, an exception is thrown and a node coredumps. Add flush gate to compaction group to wait for flushes in compaction_group::stop. Hold the gate in seal function in table::make_memtable_list. seal function is turned into a coroutine to ensure it won't throw. Wait until async_gate is closed before flushing, to ensure that all data is written into sstables. Stop ongoing compactions beforehand. Remove unnecessary flush in tablet_storage_group_manager::merge_completion_fiber. Stop method already flushes the compaction group. Fixes: #23911. Closes scylladb/scylladb#24582 (cherry picked from commit `2ec54d4f1a`) Closes scylladb/scylladb#24950	2025-07-22 11:01:37 +03:00
Patryk Jędrzejczak	f11413274d	test: test_zero_token_nodes_multidc: properly handle reads with CL=ONE The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when: - both writes succeeded with the same replica responding first, - one of the following reads succeeded with the other replica responding before it applied mutations from any of the writes. We fix the test by not expecting reads with CL=ONE to return a row. We also harden the test by inserting different rows for every pair (CL, coordinator), where one of the two coordinators is a normal node from DC1, and the other one is a zero-token node from DC2. This change makes sure that, for example, every write really inserts a row. Fixes scylladb/scylladb#22967 The fix addresses CI flakiness and only changes the test, so it should be backported. Closes scylladb/scylladb#23518 (cherry picked from commit `21edec1ace`) Fixing conflicts required additionally backporting the log line from scylladb/scylladb#22968. Closes scylladb/scylladb#24983	2025-07-22 11:01:18 +03:00
Pavel Emelyanov	fb20a59242	Merge '[Backport 2025.1] cdc: throw error if column doesn't exist' from Scylladb[bot] in the CDC log transformer, when creating a CDC mutation based on some base table mutation, for each value of a base column we set the value in the CDC column with the same name. When looking up the column in the CDC schema by name, we may get a null pointer if a column by that name is not found. This shouldn't happen normally because the base schema and CDC schema should be compatible, and for each base column there should be a CDC column with the same name. However, there are scenarios where the base schema and CDC schema are incompatible for a short period of time when they are being altered. When a base column is being added or dropped, we could get a base mutation with this column set, and then the CDC transformer picks up the latest CDC schema which doesn't have this column. If such thing happens, we fix the code to throw an exception instead of crashing on null pointer dereference. Currently we don't have a safer approach to handle this, but this might be changed in the future. The other alternative is dropping that data silently which we prefer not to do. Throwing an error is acceptable because this scenario most likely indicates this behavior by the user: * The user adds a new column, and start writing values to the column before the ALTER is complete. or, * The user drops a column, and continues writing values to the column while it's being dropped. Both cases might as well fail with an error because the column is not found in the base table. Fixes https://github.com/scylladb/scylladb/issues/24952 backport needed - simple fix for a node crash - (cherry picked from commit `b336f282ae`) - (cherry picked from commit `86dfa6324f`) Parent PR: #24986 Closes scylladb/scylladb#25065 * github.com:scylladb/scylladb: test: cdc: add test_cdc_with_alter cdc: throw error if column doesn't exist	2025-07-22 11:00:32 +03:00
Pavel Emelyanov	7cb673b6c2	Merge '[Backport 2025.1] cdc: Forbid altering columns of CDC log tables directly' from Scylladb[bot] The set of columns of a CDC log table should be managed automatically by Scylla, and the user should not have the ability to manipulate them directly. That could lead to disastrous consequences such as a segmentation fault. In this commit, we're restricting those operations. We also provide two validation tests. One of the existing tests had to be adjusted as it modified the type of a column in a CDC log table. Since the test simply verifies that the user has sufficient permissions to perform `ALTER TABLE` on the log table, the test is still valid. Fixes scylladb/scylladb#24643 Backport: we should backport the change to all affected branches to prevent the consequences that may affect the user. - (cherry picked from commit `20d0050f4e`) - (cherry picked from commit `59800b1d66`) Parent PR: #25008 Closes scylladb/scylladb#25106 * github.com:scylladb/scylladb: cdc: Forbid altering columns of inactive CDC log table cdc: Forbid altering columns of CDC log tables directly	2025-07-22 11:00:18 +03:00
Dawid Mędrek	739dbf0f7d	cdc: Forbid altering columns of inactive CDC log table When CDC becomes disabled on the base table, the CDC log table still exsits (cf. scylladb/scylladb@adda43edc7). If it continues to exist up to the point when CDC is re-enabled on the base table, no new log table will be created -- instead, the old olg table will be re-attached. Since we want to avoid situations when the definition of the log table has become misaligned with the definition of the base table due to actions of the user, we forbid modifying the set of columns or renaming them in CDC log tables, even when they're inactive. Validation tests are provided. (cherry picked from commit `59800b1d66`)	2025-07-21 11:42:38 +00:00
Dawid Mędrek	02108573b6	cdc: Forbid altering columns of CDC log tables directly The set of columns of a CDC log table should be managed automatically by Scylla, and the user should not have the ability to manipulate them directly. That could lead to disastrous consequences such as a segmentation fault. In this commit, we're restricting those operations. We also provide two validation tests. One of the existing tests had to be adjusted as it modified the type of a column in a CDC log table. Since the test simply verifies that the user has sufficient permissions to perform `ALTER TABLE` on the log table, the test is still valid. Fixes scylladb/scylladb#24643 (cherry picked from commit `20d0050f4e`)	2025-07-21 11:42:38 +00:00
Benny Halevy	5592d66d32	token_metadata_impl: clear_gently: release version tracker early No need to wait for all members to be cleared gently. We can release the version earlier since the held version may be awaited for in barriers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `6e4803a750`)	2025-07-21 10:23:33 +03:00
Benny Halevy	7414cf1327	test: topology_custom: test_tablets_merge: add test_tablet_split_merge_with_many_tables Reproduces #23284 Currently skipped in release mode since it requires the `short_tablet_stats_refresh_interval` interval. Ref #24641 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `4a3d14a031`) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-21 10:23:31 +03:00
Benny Halevy	d70ed984cb	token_metadata: clear_and_destroy_impl when destroyed We have a lot of places in the code where a token_metadata_ptr is kept in an automatic variable and destroyed when it leaves the scope. since it's a referenced counted lw_shared_ptr, the token_metadata object is rarely destroyed in those cases, but when it is, it doesn't go through clear_gently, and in particular its tablet_metadata is not cleared gently, leading to inefficient destruction of potentially many foreign_ptr:s. This patch calls clear_and_destroy_impl that gently clears and destroys the impl object in the background using the shared_token_metadata. Fixes #13381 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2c0bafb934`)	2025-07-21 10:02:19 +03:00
Benny Halevy	e24bfb0bf3	token_metadata: keep a reference to shared_token_metadata To be used by a following patch to gently clean and destroy the token_data_impl in the background. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2b2cfaba6e`)	2025-07-21 10:00:49 +03:00
Benny Halevy	aa8e65616e	token_metadata: move make_token_metadata_ptr into shared_token_metadata class So we can use the local shared_token_metadata instance for safe background destroy of token_metadata_impl:s. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `e0a19b981a`)	2025-07-21 10:00:49 +03:00
Benny Halevy	a3025520d2	replica: database: get and expose a mutable locator::shared_token_metadata Prepare for next patch, the will use this shared_token_metadata to make mutable_token_metadata_ptr:s Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `493a2303da`)	2025-07-21 09:59:17 +03:00
Benny Halevy	2d55d417b9	locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction Sort all tablet_map_ptr:s by shard_id and then destroy them on each shard to prevent long cross-shard task queues for foreign_ptr destructions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `3acca0aa63`)	2025-07-21 09:51:08 +03:00
Michael Litvak	e54414ab28	test: cdc: add test_cdc_with_alter Add a test that tests adding and dropping a column to a table with CDC enabled while writing to it. (cherry picked from commit `86dfa6324f`)	2025-07-20 10:20:24 +02:00
Michael Litvak	9e76902da1	cdc: throw error if column doesn't exist in the CDC log transformer, when creating a CDC mutation based on some base table mutation, for each value of a base column we set the value in the CDC column with the same name. When looking up the column in the CDC schema by name, we may get a null pointer if a column by that name is not found. This shouldn't happen normally because the base schema and CDC schema should be compatible, and for each base column there should be a CDC column with the same name. However, there are scenarios where the base schema and CDC schema are incompatible for a short period of time when they are being altered. When a base column is being added or dropped, we could get a base mutation with this column set, and then the CDC transformer picks up the latest CDC schema which doesn't have this column. If such thing happens, we fix the code to throw an exception instead of crashing on null pointer dereference. Currently we don't have a safer approach to handle this, but this might be changed in the future. The other alternative is dropping that data silently which we prefer not to do. Throwing an error is acceptable because this scenario most likely indicates this behavior by the user: * The user adds a new column, and start writing values to the column before the ALTER is complete. or, * The user drops a column, and continues writing values to the column while it's being dropped. Both cases might as well fail with an error because the column is not found in the base table. Fixes scylladb/scylladb#24952 (cherry picked from commit `b336f282ae`)	2025-07-18 10:35:29 +00:00
Tomasz Grabiec	434ecdee0e	service: migration_manager: Run group0 barrier in gossip scheduling group Fixes two issues. One is potential priority inversion. The barrier will be executed using scheduling group of the first fiber which triggers it, the rest will block waiting on it. For example, CQL statements which need to sync the schema on replica side can block on the barrier triggered by streaming. That's undesirable. This is theoretical, not proved in the field. The second problem is blocking the error path. This barrier is called from the streaming error handling path. If the streaming concurrency semaphore is exhausted, and streaming fails due to timeout on obtaining the permit in check_needs_view_update_path(), the error path will block too because it will also attempt to obtain the permit as part of the group0 barrier. Running it in the gossip scheduling group prevents this. Fixes #24925 (cherry picked from commit `ee2fa58bd6`)	2025-07-17 17:24:33 +00:00
Jenkins Promoter	74b1e40502	Update ScyllaDB version to: 2025.1.6	2025-07-17 15:58:23 +03:00
Avi Kivity	36d2f80f38	Merge '[Backport 2025.1] storage_service: Use utils::chunked_vector to avoid big allocation' from Scylladb[bot] The following was seen: ``` !WARNING \| scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848 seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911 operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706 std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596 locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80 seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635 std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684 ``` Fix by using chunked_vector. Fixes #24158 - (cherry picked from commit `c5a136c3b5`) Parent PR: #24561 Closes scylladb/scylladb#24890 * github.com:scylladb/scylladb: storage_service: Use utils::chunked_vector to avoid big allocation utils: chunked_vector: implement erase() for single elements and ranges utils: chunked_vector: implement insert() for single-element inserts	2025-07-16 17:28:36 +03:00
Asias He	478d02ce83	storage_service: Use utils::chunked_vector to avoid big allocation The following was seen: ``` !WARNING \| scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848 seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911 operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706 std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596 locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80 seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635 std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684 ``` Fix by using chunked_vector. Fixes #24158 Closes scylladb/scylladb#24561 (cherry picked from commit `c5a136c3b5`)	2025-07-16 15:39:51 +08:00
Piotr Dulikowski	451cf275bf	Merge '[Backport 2025.1] Fix for cassandra role gets recreated after DROP ROLE' from Scylladb[bot] This patchset fixes regression introduced by `7e749cd848` when we started re-creating default superuser role and password from the config, even if new custom superuser was created by the user. Now we'll check, first with CL LOCAL_ONE if there is a need to create default superuser role or password, confirm it with CL QUORUM and only then atomically create role or password. If server is started without cluster quorum we'll skip creating role or password. Fixes https://github.com/scylladb/scylladb/issues/24469 Backport: all versions since 2024.2 - (cherry picked from commit `68fc4c6d61`) - (cherry picked from commit `c96c5bfef5`) - (cherry picked from commit `2e2ba84e94`) - (cherry picked from commit `f85d73d405`) - (cherry picked from commit `d9ec746c6d`) - (cherry picked from commit `a3bb679f49`) - (cherry picked from commit `67a4bfc152`) - (cherry picked from commit `0ffddce636`) - (cherry picked from commit `5e7ac34822`) Parent PR: #24451 Closes scylladb/scylladb#24693 * github.com:scylladb/scylladb: test: auth_cluster: add test for password reset procedure auth: cache roles table scan during startup test: auth_cluster: add test for replacing default superuser test: pylib: add ability to specify default authenticator during server_start test: pylib: allow rolling restart without waiting for cql auth: split auth-v2 logic for adding default superuser password auth: split auth-v2 logic for adding default superuser role auth: ldap: fix waiting for underlying role manager auth: wait for default role creation before starting authorizer and authenticator	2025-07-16 08:08:49 +02:00
Avi Kivity	60135c8d6c	utils: chunked_vector: implement erase() for single elements and ranges Implement using std::rotate() and resize(). The elements to be erased are rotated to the end, then resized out of existence. Again we defer optimization for trivially copyable types. Unit tests are added. Needed for range_streamer with token_ranges using chunked_vector. (cherry picked from commit `d6eefce145`)	2025-07-16 07:38:55 +08:00
Avi Kivity	efad391fac	utils: chunked_vector: implement insert() for single-element inserts partition_range_compat's unwrap() needs insert if we are to use it for chunked_vector (which we do). Implement using push_back() and std::rotate(). emplace(iterator, args) is also implemented, though the benefit is diluted (it will be moved after construction). The implementation isn't optimal - if T is trivially copyable then using std::memmove() will be much faster that std::rotate(), but this complex optimization is left for later. Unit tests are added. (cherry picked from commit `5301f3d0b5`)	2025-07-16 07:38:55 +08:00
Botond Dénes	9258d0e1cf	test/cluster/test_read_repair: write 100 rows in trace test This test asserts that a read repair really happened. To ensure this happens it writes a single partition after enabling the database_apply error injection point. For some reason, the write is sometimes reordered with the error injection and the write will get replicated to both nodes and no read repair will happen, failing the test. To make the test less sensitive to such rare reordering, add a clustering column to the table and write a 100 rows. The chance of all 100 of them being reordered with the error injection should be low enough that it doesn't happen again (famous last words). Fixes: #24330 Closes scylladb/scylladb#24403 (cherry picked from commit `495f607e73`) Closes scylladb/scylladb#24972	2025-07-15 20:16:52 +03:00
Yaron Kaikov	83baf94bea	dist/common/scripts/scylla_sysconfig_setup: fix `SyntaxWarning: invalid escape sequence` There are invalid escape sequence warnings where raw strings should be used for the regex patterns Fixes: https://github.com/scylladb/scylladb/issues/24915 Closes scylladb/scylladb#24916 (cherry picked from commit `fdcaa9a7e7`) Closes scylladb/scylladb#24966	2025-07-15 11:09:16 +02:00
Yaron Kaikov	a20e56230b	auto-backport.py: Avoid bot push to existing backport branches Changed the backport logic so that the bot only pushes the backport branch if it does not already exist in the remote fork. If the branch exists, the bot skips the push, allowing only users to update (force-push) the branch after the backport PR is open. Fixes: https://github.com/scylladb/scylladb/issues/24953 Closes scylladb/scylladb#24954 (cherry picked from commit `ed7c7784e4`) Closes scylladb/scylladb#24965	2025-07-15 10:54:55 +02:00
Aleksandra Martyniuk	23bcff5c52	repair: Reduce max row buf size when small table optimization is on If small_table_optimization is on, a repair works on a whole table simultaneously. It may be distributed across the whole cluster and all nodes might participate in repair. On a repair master, row buffer is copied for each repair peer. This means that the memory scales with the number of peers. In large clusters, repair with small_table_optimization leads to OOM. Divide the max_row_buf_size by the number of repair peers if small_table_optimization is on. Use max_row_buf_size to calculate number of units taken from mem_sem. Fixes: https://github.com/scylladb/scylladb/issues/22244. Closes scylladb/scylladb#24868 (cherry picked from commit `17272c2f3b`) Closes scylladb/scylladb#24904 scylla-2025.1.5 scylla-2025.1.5-candidate-20250715054333	2025-07-15 09:38:16 +03:00
Jenkins Promoter	8c860ad264	Update pgo profiles - aarch64	2025-07-15 04:37:19 +03:00
Jenkins Promoter	e24f80aef7	Update pgo profiles - x86_64	2025-07-15 04:36:48 +03:00
Raphael S. Carvalho	c307d84925	test: fix flakiness of test_missing_data 2025.1 only is susceptible. Merge has slightly different logic in master, test had to be adjusted for 2025.1 but is flaky. Can happen two successive merges cause the merge waiting to never finish. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Fixes scylladb/scylladb#24821 Closes scylladb/scylladb#24936	2025-07-14 14:28:05 +03:00
Gleb Natapov	ce492fa5d0	api: unregister raft_topology_get_cmd_status on shutdown In `c8ce9d1c60` we introduced raft_topology_get_cmd_status REST api but the commit forgot to unregister the handler during shutdown. Fixes #24910 Closes scylladb/scylladb#24911 (cherry picked from commit `89f2edf308`) Closes scylladb/scylladb#24921	2025-07-14 11:44:44 +02:00
Lakshmi Narayanan Sreethar	46dfe09e64	db/corrupt_data_handler: guard stop() against null _fragment_semaphore The `system_table_corrupt_data_handler::_fragment_semaphore` member is initialized only when the `system_keyspace` sharded service is initialized by `main`. If the server shuts down before that due to an unrelated reason, `_fragment_semaphore` remains default-initialized to `nullptr`. When the shutdown process later attempts to call `stop()` on `system_table_corrupt_data_handler`, it tries to call `stop()` on `_fragment_semaphore`, leading to a segfault. Fix this by checking if `_fragment_semaphore` is null before invoking `stop()` on it. Although `corrupt_data_handler` was backported to 2025.1, this issue does not occur in 2025.2 and master. The recent versions include #23113, which changes how the system keyspace is stopped and PR #24492, which originally introduced `corrupt_data_handler`, builds on that change to ensure `_fragment_semaphore` is stopped only if it has been created. Fixes #24920 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#24931	2025-07-14 12:06:29 +03:00
Avi Kivity	477cad246f	tools: optimized_clang: make it work in the presence of a scylladb profile optimized_clang.sh trains the compiler using profile-guided optimization (pgo). However, while doing that, it builds scylladb using its own profile stored in pgo/profiles and decompressed into build/profile.profdata. Due to the funky directory structure used for training the compiler, that path is invalid during the training and the build fails. The workaround was to build on a cloud machine instead of a workstation - this worked because the cloud machine didn't have git-lfs installed, and therefore did not see the stored profile, and the whole mess was averted. To make this work on a machine that does have access to stored profiles, disable use of the stored profile even if it exists. Fixes #22713 Closes scylladb/scylladb#24571 (cherry picked from commit `52f11e140f`) Closes scylladb/scylladb#24620	2025-07-13 21:43:49 +03:00
Ferenc Szili	3e3147c03a	logging: Add row count to large partition warning message When writing large partitions, that is: partitions with size or row count above a configurable threshold, ScyllaDB outputs a warning to the log: WARN ... large_data - Writing large partition test/test: (1200031 bytes) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db This warning contains the information about the size of the partition, but it does not contain the number of rows written. This can lead to confusion because in cases where the warning was written because of the row count being larger than the threshold, but the partition size is below the threshold, the warning will only contain the partition size in bytes, leading the user to believe the warning was output because of the partition size, when in reality it was the row count that triggered the warning. See #20125 This change adds a size_desc argument to cql_table_large_data_handler::try_record(), which will contain the description of the size of the object written. This method is used to output warnings for large partitions, row counts, row sizes and cell sizes. This change does not modify the warning message for row and cell sizes, only for partition size and row count. The warning for large partitions and row counts will now look like this: WARN ... large_data - Writing large partition test/test: (1200031 bytes/100001 rows) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db Closes scylladb/scylladb#22010 (cherry picked from commit `96267960f8`) Closes scylladb/scylladb#24681	2025-07-10 16:20:23 +02:00
Marcin Maliszkiewicz	0bbc701863	test: auth_cluster: add test for password reset procedure (cherry picked from commit `aef531077b`)	2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz	52dfc23f9e	auth: cache roles table scan during startup It may be particularly beneficial during connection storms on startup. In such cases, it can happen that none of the user's read requests succeed, preventing the cache from being populated. This, in turn, makes it more difficult for subsequent reads to succeed, reducing resiliency against such storms. (cherry picked from commit `887c57098e`)	2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz	b91eee6103	test: auth_cluster: add test for replacing default superuser This test demonstrates creating custom superuser guide: https://opensource.docs.scylladb.com/stable/operating-scylla/security/create-superuser.html (cherry picked from commit `d9223b61a2`)	2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz	a10e17106b	test: pylib: add ability to specify default authenticator during server_start Sometimes we may not want to use default cassandra role for control connection, especially when we test dropping default role. (cherry picked from commit 08bf7237f066cead133bf0cac9bba215f238070a)	2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz	b79415a072	test: pylib: allow rolling restart without waiting for cql Waiting for CQL requires default superuser being present in db. In some cases we may delete it and still want to do rolling restart. Additionally if we need CQL we may want to wait after restart is complete (once, and not for each node). (cherry picked from commit `d9ec746c6d`)	2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz	0665f9a12b	auth: split auth-v2 logic for adding default superuser password In raft mode (auth-v2) we need to do atomic write after read as we give stricter consistency guarantees. Instead of patching legacy logic this commit adds different path as: - old code may be less tested now so it's best to not change it - new code path avoids quorum selects in a typical flow (passwords set) There may be a case when user deletes a superuser or password right before restarting a node, in such case we may ommit updating a password but: - this is a trade-off between quorum reads on startup - it's far more important to not update password when it shouldn't be - if needed password will be updated on next node restart If there is no quorum on startup we'll skip creating password because we can't perform any raft operation. Additionally this fixes a problem when password is created despite having non default superuser in auth-v2. (cherry picked from commit `f85d73d405`)	2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz	b916f51df1	auth: split auth-v2 logic for adding default superuser role In raft mode (auth-v2) we need to do atomic write after read as we give stricter consistency guarantees. Instead of patching legacy logic this commit adds different path as: - old code may be less tested now so it's best to not change it - new code path avoids quorum selects in a typical flow (roles set) This fixes a problem when superuser role is created despite having non default superuser in auth-v2. If there is no quorum on startup we'll skip creating role because we can't perform any raft operation. (cherry picked from commit `2e2ba84e94`)	2025-07-10 11:19:09 +02:00
Marcin Maliszkiewicz	f5bf479ae7	auth: ldap: fix waiting for underlying role manager ldap_role_manager depends on standard_role_manager, therefore it needs to wait for superuser initialization. If this is missing, the password authenticator will start checking the default password too early and may fail to create the default password if there is no default role yet. Currently password authenticator will create password together with the role in such case but in following commits we want to separate those responsibilities correctly. (cherry picked from commit `c96c5bfef5`)	2025-07-10 11:19:09 +02:00

1 2 3 4 5 ...

46897 Commits