Right now, service levels are migrated in one group0 command and auth
is migrated in the next one. This has a bad effect on the group0 state
reload logic - modifying service levels in group0 causes the effective
service levels cache to be recalculated, and to do so we need to fetch
information about all roles. If the reload happens after SL upgrade and
before auth upgrade, the query for roles will be directed to the legacy
auth tables in system_auth - and the query, being a potentially remote
query, has a timeout. If the query times out, it will throw
an exception which will break the group0 apply fiber and the node will
need to be restarted to bring it back to work.
In order to solve this issue, make sure that the service level module
does not start populating and using the service level cache until both
service levels and auth are migrated to raft. This is achieved by adding
the check both to the cache population logic and the effective service
level getter - they now look at service level's accessor new method,
`can_use_effective_service_level_cache` which takes a look at the auth
version.
Fixes: scylladb/scylladb#24963
(cherry picked from commit 2bb800c004)
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.
The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B
Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.
Even if we didn't deadlock, and the streaming semaphore was simply exhausted
by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation.
To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.
Fixes#24807Fixes#24925
- (cherry picked from commit ee2fa58bd6)
- (cherry picked from commit dff2b01237)
Parent PR: #24929Closesscylladb/scylladb#25052
* github.com:scylladb/scylladb:
streaming: Avoid deadlock by running view checks in a separate scheduling group
service: migration_manager: Run group0 barrier in gossip scheduling group
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.
The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B
Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.
To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.
Fixes: #24807
(cherry picked from commit dff2b01237)
Currently, progress of a parent task depends on expected_total_workload,
expected_children_number, and children progresses. Basically, if total
workload is known or all children have already been created, progresses
of children are summed up. Otherwise binary progress is returned.
As a result, two tasks of the same type may return progress in different
units. If they are children of the same task and this parent gathers the
progress - it becomes meaningless.
Drop expected_children_number as we can't assume that children are able
to show their progresses.
Modify get_progress method - progress is calculated based on children
progresses. If expected_total_workload isn't specified, the total
progress of a task may grow. If expected_total_workload isn't specified
and no children are created, empty progress (0/0) is returned.
Fixes: https://github.com/scylladb/scylladb/issues/24650.
Closesscylladb/scylladb#25113
(cherry picked from commit a7ee2bbbd8)
Closesscylladb/scylladb#25197
Currently, nodetool repair command repairs both vnode and tablet keyspaces
if no keyspace is specified. We should use this command to repair
only vnode keyspaces, but this isn't easily accessible - we have to
explicitly run repair only on vnode keyspaces.
nodetool repair skips tablet keyspaces unless a tablet keyspace
is explicitely passed as an argument.
Fixes: #24040.
Closesscylladb/scylladb#24042
(cherry picked from commit 6f8b378e80)
Closesscylladb/scylladb#25152
The functions password_authenticator::start and
standard_role_manager::start have a similar structure: they spawn a
fiber which invokes a callback that performs some migration until that
migration succeeds. Both handlers set a shared promise called
_superuser_created_promise (those are actually two promises, one for the
password authenticator and the other for the role manager).
The handlers are similar in both cases. They check if auth is in legacy
mode, and behave differently depending on that. If in legacy mode, the
promise is set (if it was not set before), and some legacy migration
actions follow. In auth-on-raft mode, the superuser is attempted to be
created, and if it succeeds then the promise is _unconditionally_ set.
While it makes sense at a glance to set the promise unconditionally,
there is a non-obvious corner case during upgrade to topology on raft.
During the upgrade, auth switches from the legacy mode to auth on raft
mode. Thus, if the callback didn't succeed in legacy mode and then tries
to run in auth-on-raft mode and succeds, it will unconditionally set a
promise that was already set - this is a bug and triggers an assertion
in seastar.
Fix the issue by surrounding the `shared_promise::set_value` call with
an `if` - like it is already done for the legacy case.
Backport note: the bugfix part for password_authenticator was removed
from the commit because 2025.1 does not have scylladb/scylladb#22532 and
thus does not contain the bug.
Fixes: scylladb/scylladb#24975Closesscylladb/scylladb#24976
(cherry picked from commit a14b7f71fe)
Closesscylladb/scylladb#25017
Fixes#24447
This factory type, which is really more a data holder/connection producer
per connection instance, creates, if using https, a new certificate_credentials
on every instance. Which when used by S3 client is per client and
scheduling groups.
Which eventually means that we will do a set_system_trust + "cold" handshake
for every tls connection created this way.
This will cause both IO and cold/expensive certificate checking -> possible
stalls/wasted CPU. Since the credentials object in question is literally a
"just trust system", it could very well be shared across the shard.
This PR adds a thread local static cached credentials object and uses this
instead. Could consider moving this to seastar, but maybe this is too much.
Closesscylladb/scylladb#24448
(cherry picked from commit 80feb8b676)
Closesscylladb/scylladb#24460
When a tablet transitions to a post-cleanup stage on the leaving replica
we deallocate its storage group. Before the storage can be deallocated
and destroyed, we must make sure it's cleaned up and stopped properly.
Normally this happens during the tablet cleanup stage, when
table::cleanup_table is called, so by the time we transition to the next
stage the storage group is already stopped.
However, it's possible that tablet cleanup did not run in some scenario:
1. The topology coordinator runs tablet cleanup on the leaving replica.
2. The leaving replica is restarted.
3. When the leaving replica starts, still in `cleanup` stage, it
allocates a storage group for the tablet.
4. The topology coordinator moves to the next stage.
5. The leaving replica deallocates the storage group, but it was not
stopped.
To address this scenario, we always stop the storage group when
deallocating it. Usually it will be already stopped and complete
immediately, and otherwise it will be stopped in the background.
Fixesscylladb/scylladb#24857Fixesscylladb/scylladb#24828Closesscylladb/scylladb#24896
(cherry picked from commit fa24fd7cc3)
Closesscylladb/scylladb#24906
As seen in #23284, when the tablet_metadata contains many tables, even empty ones,
we're seeing a long queue of seastar tasks coming from the individual destruction of
`tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>`.
This change improves `tablet_metadata::clear_gently` to destroy the `tablet_map_ptr` objects
on their owner shard by sorting them into vectors, per- owner shard.
Also, background call to clear_gently was added to `~token_metadata`, as it is destroyed
arbitrarily when automatic token_metadata_ptr variables go out of scope, so that the
contained tablet_metadata would be cleared gently.
Finally, a unit test was added to reproduce the `Too long queue accumulated for gossip` symptom
and verify that it is gone with this change.
Fixes#24814
Refs #23284
This change is not marked as fixing the issue since we still need to verify that there is no impact on query performance, reactor stalls, or large allocations, with a large number of tablet-based tables.
* Since the issue exists in 2025.1, requesting backport to 2025.1 and upwards
- (cherry picked from commit 3acca0aa63)
- (cherry picked from commit 493a2303da)
- (cherry picked from commit e0a19b981a)
- (cherry picked from commit 2b2cfaba6e)
- (cherry picked from commit 2c0bafb934)
- (cherry picked from commit 4a3d14a031)
- (cherry picked from commit 6e4803a750)
Parent PR: #24618Closesscylladb/scylladb#24862
* github.com:scylladb/scylladb:
token_metadata_impl: clear_gently: release version tracker early
test: topology_custom: test_tablets_merge: add test_tablet_split_merge_with_many_tables
token_metadata: clear_and_destroy_impl when destroyed
token_metadata: keep a reference to shared_token_metadata
token_metadata: move make_token_metadata_ptr into shared_token_metadata class
replica: database: get and expose a mutable locator::shared_token_metadata
locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction
Destructor of database_sstable_write_monitor, which is created
in table::try_flush_memtable_to_sstable, tries to get the compaction
state of the processed compaction group. If at this point
the compaction group is already stopped (and the compaction state
is removed), e.g. due to concurrent tablet merge, an exception is
thrown and a node coredumps.
Add flush gate to compaction group to wait for flushes in
compaction_group::stop. Hold the gate in seal function in
table::make_memtable_list. seal function is turned into
a coroutine to ensure it won't throw.
Wait until async_gate is closed before flushing, to ensure that
all data is written into sstables. Stop ongoing compactions
beforehand.
Remove unnecessary flush in tablet_storage_group_manager::merge_completion_fiber.
Stop method already flushes the compaction group.
Fixes: #23911.
Closesscylladb/scylladb#24582
(cherry picked from commit 2ec54d4f1a)
Closesscylladb/scylladb#24950
The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when:
- both writes succeeded with the same replica responding first,
- one of the following reads succeeded with the other replica
responding before it applied mutations from any of the writes.
We fix the test by not expecting reads with CL=ONE to return a row.
We also harden the test by inserting different rows for every pair
(CL, coordinator), where one of the two coordinators is a normal
node from DC1, and the other one is a zero-token node from DC2.
This change makes sure that, for example, every write really
inserts a row.
Fixesscylladb/scylladb#22967
The fix addresses CI flakiness and only changes the test, so it
should be backported.
Closesscylladb/scylladb#23518
(cherry picked from commit 21edec1ace)
Fixing conflicts required additionally backporting the log line
from scylladb/scylladb#22968.
Closesscylladb/scylladb#24983
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.
When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.
However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.
If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.
Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
while it's being dropped.
Both cases might as well fail with an error because the column is not
found in the base table.
Fixes https://github.com/scylladb/scylladb/issues/24952
backport needed - simple fix for a node crash
- (cherry picked from commit b336f282ae)
- (cherry picked from commit 86dfa6324f)
Parent PR: #24986Closesscylladb/scylladb#25065
* github.com:scylladb/scylladb:
test: cdc: add test_cdc_with_alter
cdc: throw error if column doesn't exist
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.
In this commit, we're restricting those operations. We also provide two
validation tests.
One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.
Fixes scylladb/scylladb#24643
Backport: we should backport the change to all affected
branches to prevent the consequences that may affect the user.
- (cherry picked from commit 20d0050f4e)
- (cherry picked from commit 59800b1d66)
Parent PR: #25008Closesscylladb/scylladb#25106
* github.com:scylladb/scylladb:
cdc: Forbid altering columns of inactive CDC log table
cdc: Forbid altering columns of CDC log tables directly
When CDC becomes disabled on the base table, the CDC log table
still exsits (cf. scylladb/scylladb@adda43edc7).
If it continues to exist up to the point when CDC is re-enabled
on the base table, no new log table will be created -- instead,
the old olg table will be *re-attached*.
Since we want to avoid situations when the definition of the log
table has become misaligned with the definition of the base table
due to actions of the user, we forbid modifying the set of columns
or renaming them in CDC log tables, even when they're inactive.
Validation tests are provided.
(cherry picked from commit 59800b1d66)
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.
In this commit, we're restricting those operations. We also provide two
validation tests.
One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.
Fixesscylladb/scylladb#24643
(cherry picked from commit 20d0050f4e)
No need to wait for all members to be cleared gently.
We can release the version earlier since the
held version may be awaited for in barriers.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6e4803a750)
Reproduces #23284
Currently skipped in release mode since it requires
the `short_tablet_stats_refresh_interval` interval.
Ref #24641
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4a3d14a031)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We have a lot of places in the code where
a token_metadata_ptr is kept in an automatic
variable and destroyed when it leaves the scope.
since it's a referenced counted lw_shared_ptr,
the token_metadata object is rarely destroyed in
those cases, but when it is, it doesn't go through
clear_gently, and in particular its tablet_metadata
is not cleared gently, leading to inefficient destruction
of potentially many foreign_ptr:s.
This patch calls clear_and_destroy_impl that gently
clears and destroys the impl object in the background
using the shared_token_metadata.
Fixes#13381
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2c0bafb934)
To be used by a following patch to gently clean and destroy
the token_data_impl in the background.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2b2cfaba6e)
So we can use the local shared_token_metadata instance
for safe background destroy of token_metadata_impl:s.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit e0a19b981a)
Prepare for next patch, the will use this shared_token_metadata
to make mutable_token_metadata_ptr:s
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 493a2303da)
Sort all tablet_map_ptr:s by shard_id
and then destroy them on each shard to prevent
long cross-shard task queues for foreign_ptr destructions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3acca0aa63)
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.
When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.
However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.
If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.
Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
while it's being dropped.
Both cases might as well fail with an error because the column is not
found in the base table.
Fixesscylladb/scylladb#24952
(cherry picked from commit b336f282ae)
Fixes two issues.
One is potential priority inversion. The barrier will be executed
using scheduling group of the first fiber which triggers it, the rest
will block waiting on it. For example, CQL statements which need to
sync the schema on replica side can block on the barrier triggered by
streaming. That's undesirable. This is theoretical, not proved in the
field.
The second problem is blocking the error path. This barrier is called
from the streaming error handling path. If the streaming concurrency
semaphore is exhausted, and streaming fails due to timeout on
obtaining the permit in check_needs_view_update_path(), the error path
will block too because it will also attempt to obtain the permit as
part of the group0 barrier. Running it in the gossip scheduling group
prevents this.
Fixes#24925
(cherry picked from commit ee2fa58bd6)
The following was seen:
```
!WARNING | scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
(inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848
seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911
operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706
std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196
(inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515
(inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380
(inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596
locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
(inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635
std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684
```
Fix by using chunked_vector.
Fixes#24158
- (cherry picked from commit c5a136c3b5)
Parent PR: #24561Closesscylladb/scylladb#24890
* github.com:scylladb/scylladb:
storage_service: Use utils::chunked_vector to avoid big allocation
utils: chunked_vector: implement erase() for single elements and ranges
utils: chunked_vector: implement insert() for single-element inserts
The following was seen:
```
!WARNING | scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
(inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848
seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911
operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706
std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196
(inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515
(inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380
(inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596
locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
(inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635
std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684
```
Fix by using chunked_vector.
Fixes#24158Closesscylladb/scylladb#24561
(cherry picked from commit c5a136c3b5)
This patchset fixes regression introduced by 7e749cd848 when we started re-creating default superuser role and password from the config, even if new custom superuser was created by the user.
Now we'll check, first with CL LOCAL_ONE if there is a need to create default superuser role or password, confirm
it with CL QUORUM and only then atomically create role or password.
If server is started without cluster quorum we'll skip creating role or password.
Fixes https://github.com/scylladb/scylladb/issues/24469
Backport: all versions since 2024.2
- (cherry picked from commit 68fc4c6d61)
- (cherry picked from commit c96c5bfef5)
- (cherry picked from commit 2e2ba84e94)
- (cherry picked from commit f85d73d405)
- (cherry picked from commit d9ec746c6d)
- (cherry picked from commit a3bb679f49)
- (cherry picked from commit 67a4bfc152)
- (cherry picked from commit 0ffddce636)
- (cherry picked from commit 5e7ac34822)
Parent PR: #24451Closesscylladb/scylladb#24693
* github.com:scylladb/scylladb:
test: auth_cluster: add test for password reset procedure
auth: cache roles table scan during startup
test: auth_cluster: add test for replacing default superuser
test: pylib: add ability to specify default authenticator during server_start
test: pylib: allow rolling restart without waiting for cql
auth: split auth-v2 logic for adding default superuser password
auth: split auth-v2 logic for adding default superuser role
auth: ldap: fix waiting for underlying role manager
auth: wait for default role creation before starting authorizer and authenticator
Implement using std::rotate() and resize(). The elements to be erased
are rotated to the end, then resized out of existence.
Again we defer optimization for trivially copyable types.
Unit tests are added.
Needed for range_streamer with token_ranges using chunked_vector.
(cherry picked from commit d6eefce145)
partition_range_compat's unwrap() needs insert if we are to
use it for chunked_vector (which we do).
Implement using push_back() and std::rotate().
emplace(iterator, args) is also implemented, though the benefit
is diluted (it will be moved after construction).
The implementation isn't optimal - if T is trivially copyable
then using std::memmove() will be much faster that std::rotate(),
but this complex optimization is left for later.
Unit tests are added.
(cherry picked from commit 5301f3d0b5)
This test asserts that a read repair really happened. To ensure this
happens it writes a single partition after enabling the database_apply
error injection point. For some reason, the write is sometimes reordered
with the error injection and the write will get replicated to both nodes
and no read repair will happen, failing the test.
To make the test less sensitive to such rare reordering, add a
clustering column to the table and write a 100 rows. The chance of *all*
100 of them being reordered with the error injection should be low
enough that it doesn't happen again (famous last words).
Fixes: #24330Closesscylladb/scylladb#24403
(cherry picked from commit 495f607e73)
Closesscylladb/scylladb#24972
If small_table_optimization is on, a repair works on a whole table
simultaneously. It may be distributed across the whole cluster and
all nodes might participate in repair.
On a repair master, row buffer is copied for each repair peer.
This means that the memory scales with the number of peers.
In large clusters, repair with small_table_optimization leads to OOM.
Divide the max_row_buf_size by the number of repair peers if
small_table_optimization is on.
Use max_row_buf_size to calculate number of units taken from mem_sem.
Fixes: https://github.com/scylladb/scylladb/issues/22244.
Closesscylladb/scylladb#24868
(cherry picked from commit 17272c2f3b)
Closesscylladb/scylladb#24904
2025.1 only is susceptible. Merge has slightly different logic in
master, test had to be adjusted for 2025.1 but is flaky.
Can happen two successive merges cause the merge waiting to never
finish.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixesscylladb/scylladb#24821Closesscylladb/scylladb#24936
The `system_table_corrupt_data_handler::_fragment_semaphore` member is
initialized only when the `system_keyspace` sharded service is
initialized by `main`. If the server shuts down before that due to an
unrelated reason, `_fragment_semaphore` remains default-initialized to
`nullptr`. When the shutdown process later attempts to call `stop()` on
`system_table_corrupt_data_handler`, it tries to call `stop()` on
`_fragment_semaphore`, leading to a segfault.
Fix this by checking if `_fragment_semaphore` is null before invoking
`stop()` on it.
Although `corrupt_data_handler` was backported to 2025.1, this issue
does not occur in 2025.2 and master. The recent versions include #23113,
which changes how the system keyspace is stopped and PR #24492, which
originally introduced `corrupt_data_handler`, builds on that change to
ensure `_fragment_semaphore` is stopped only if it has been created.
Fixes#24920
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#24931
optimized_clang.sh trains the compiler using profile-guided optimization
(pgo). However, while doing that, it builds scylladb using its own profile
stored in pgo/profiles and decompressed into build/profile.profdata. Due
to the funky directory structure used for training the compiler, that
path is invalid during the training and the build fails.
The workaround was to build on a cloud machine instead of a workstation -
this worked because the cloud machine didn't have git-lfs installed, and
therefore did not see the stored profile, and the whole mess was averted.
To make this work on a machine that does have access to stored profiles,
disable use of the stored profile even if it exists.
Fixes#22713Closesscylladb/scylladb#24571
(cherry picked from commit 52f11e140f)
Closesscylladb/scylladb#24620
When writing large partitions, that is: partitions with size or row count
above a configurable threshold, ScyllaDB outputs a warning to the log:
WARN ... large_data - Writing large partition test/test: (1200031 bytes) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db
This warning contains the information about the size of the partition,
but it does not contain the number of rows written. This can lead to
confusion because in cases where the warning was written because of the
row count being larger than the threshold, but the partition size is below
the threshold, the warning will only contain the partition size in bytes,
leading the user to believe the warning was output because of the
partition size, when in reality it was the row count that triggered the
warning. See #20125
This change adds a size_desc argument to cql_table_large_data_handler::try_record(),
which will contain the description of the size of the object written.
This method is used to output warnings for large partitions, row counts,
row sizes and cell sizes. This change does not modify the warning message
for row and cell sizes, only for partition size and row count.
The warning for large partitions and row counts will now look like this:
WARN ... large_data - Writing large partition test/test: (1200031 bytes/100001 rows) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db
Closesscylladb/scylladb#22010
(cherry picked from commit 96267960f8)
Closesscylladb/scylladb#24681
It may be particularly beneficial during connection
storms on startup. In such cases, it can happen that
none of the user's read requests succeed, preventing
the cache from being populated. This, in turn, makes
it more difficult for subsequent reads to
succeed, reducing resiliency against such storms.
(cherry picked from commit 887c57098e)
Sometimes we may not want to use default cassandra role for
control connection, especially when we test dropping default role.
(cherry picked from commit 08bf7237f066cead133bf0cac9bba215f238070a)
Waiting for CQL requires default superuser being present
in db. In some cases we may delete it and still want to do
rolling restart. Additionally if we need CQL we may want to
wait after restart is complete (once, and not for each node).
(cherry picked from commit d9ec746c6d)
In raft mode (auth-v2) we need to do atomic write after read as
we give stricter consistency guarantees. Instead of patching
legacy logic this commit adds different path as:
- old code may be less tested now so it's best to not change it
- new code path avoids quorum selects in a typical flow (passwords set)
There may be a case when user deletes a superuser or password
right before restarting a node, in such case we may ommit
updating a password but:
- this is a trade-off between quorum reads on startup
- it's far more important to not update password when it shouldn't be
- if needed password will be updated on next node restart
If there is no quorum on startup we'll skip creating password
because we can't perform any raft operation.
Additionally this fixes a problem when password is created despite
having non default superuser in auth-v2.
(cherry picked from commit f85d73d405)
In raft mode (auth-v2) we need to do atomic write after read as
we give stricter consistency guarantees. Instead of patching
legacy logic this commit adds different path as:
- old code may be less tested now so it's best to not change it
- new code path avoids quorum selects in a typical flow (roles set)
This fixes a problem when superuser role is created despite
having non default superuser in auth-v2.
If there is no quorum on startup we'll skip creating role
because we can't perform any raft operation.
(cherry picked from commit 2e2ba84e94)
ldap_role_manager depends on standard_role_manager,
therefore it needs to wait for superuser initialization.
If this is missing, the password authenticator will start
checking the default password too early and may fail to
create the default password if there is no default
role yet.
Currently password authenticator will create password
together with the role in such case but in following
commits we want to separate those responsibilities correctly.
(cherry picked from commit c96c5bfef5)