Commit Graph

47902 Commits

Author SHA1 Message Date
Pavel Emelyanov
71e9f5e662 sstables_loader: Fix load-and-stream vs skip-cleanup check
The intention was to fail the REST API call in case --skip-cleanup is
requested for --load-and-stream loading. The corresponding if expression
is checking something else :( despite log message is correct.

Fixes: https://github.com/scylladb/scylladb/issues/24913

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Signed-off-by: Ran Regev <ran.regev@scylladb.com>
(cherry picked from commit bd3bd089e1)

Closes scylladb/scylladb#24947
2025-07-15 13:08:32 +03:00
Yaron Kaikov
f1d4266b7a dist/common/scripts/scylla_sysconfig_setup: fix SyntaxWarning: invalid escape sequence
There are invalid escape sequence warnings where raw strings should be used for the regex patterns

Fixes: https://github.com/scylladb/scylladb/issues/24915

Closes scylladb/scylladb#24916

(cherry picked from commit fdcaa9a7e7)

Closes scylladb/scylladb#24968
2025-07-15 11:06:41 +02:00
Yaron Kaikov
9c0181e813 auto-backport.py: Avoid bot push to existing backport branches
Changed the backport logic so that the bot only pushes the backport branch if it does not already exist in the remote fork.
If the branch exists, the bot skips the push, allowing only users to update (force-push) the branch after the backport PR is open.

Fixes: https://github.com/scylladb/scylladb/issues/24953

Closes scylladb/scylladb#24954

(cherry picked from commit ed7c7784e4)

Closes scylladb/scylladb#24967
2025-07-15 10:27:28 +02:00
Aleksandra Martyniuk
6f8b378e80 nodetool: repair: skip tablet keyspaces
Currently, nodetool repair command repairs both vnode and tablet keyspaces
if no keyspace is specified. We should use this command to repair
only vnode keyspaces, but this isn't easily accessible - we have to
explicitly run repair only on vnode keyspaces.

nodetool repair skips tablet keyspaces unless a tablet keyspace
is explicitely passed as an argument.

Fixes: #24040.

Closes scylladb/scylladb#24042
2025-07-15 06:36:08 +03:00
Jenkins Promoter
afadcc648d Update pgo profiles - aarch64 2025-07-15 05:39:11 +03:00
Jenkins Promoter
2397a93410 Update pgo profiles - x86_64 2025-07-15 05:23:22 +03:00
Yaron Kaikov
6e06c57fc7 packaging: add ps command to dependancies
ScyllaDB container image doesn't have ps command installed, while this command is used by perftune.py script shipped within the same image. This breaks node and container tuning in Scylla Operator.

Fixes: #24827

Closes scylladb/scylladb#24830

(cherry picked from commit 66ff6ab6f9)

Closes scylladb/scylladb#24955
2025-07-14 14:26:38 +03:00
Gleb Natapov
ece8a8b3bc api: unregister raft_topology_get_cmd_status on shutdown
In c8ce9d1c60 we introduced
raft_topology_get_cmd_status REST api but the commit forgot to
unregister the handler during shutdown.

Fixes #24910

Closes scylladb/scylladb#24911

(cherry picked from commit 89f2edf308)

Closes scylladb/scylladb#24922
2025-07-14 11:39:42 +02:00
Avi Kivity
5bed6c7a7f storage_proxy: avoid large allocation when storing batch in system.batchlog
Currently, when computing the mutation to be stored in system.batchlog,
we go through data_value. In turn this goes through `bytes` type
(#24810), so it causes a large contiguous allocation if the batch is
large.

Fix by going through the more primitive, but less contiguous,
atomic_cell API.

Fixes #24809.

Closes scylladb/scylladb#24811

(cherry picked from commit 60f407bff4)

Closes scylladb/scylladb#24845
2025-07-13 14:11:01 +03:00
Patryk Jędrzejczak
605106a9c6 Merge '[Backport 2025.2] Make it easier to debug stuck raft topology operation.' from Scylladb[bot]
The series adds more logging and provides new REST api around topology command rpc execution to allow easier debugging of stuck topology operations.

Backport since we want to have in the production as quick as possible.

Fixes #24860

- (cherry picked from commit c8ce9d1c60)

- (cherry picked from commit 4e6369f35b)

Parent PR: #24799

Closes scylladb/scylladb#24879

* https://github.com/scylladb/scylladb:
  topology coordinator: log a start and an end of topology coordinator command execution at info level
  topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc
2025-07-09 12:58:14 +02:00
Piotr Dulikowski
e4dde34f52 Merge '[Backport 2025.2] main: don't start maintenance auth service if not enabled' from Scylladb[bot]
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).

When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.

Fixes: https://github.com/scylladb/scylladb/issues/24528
Backport: all not eol versions since 6.0 and 2025.1

- (cherry picked from commit 97c60b8153)

- (cherry picked from commit dd01852341)

Parent PR: #24527

Closes scylladb/scylladb#24570

* github.com:scylladb/scylladb:
  test: add test for live updates of permissions cache config
  main: don't start maintenance auth service if not enabled
2025-07-09 09:47:57 +02:00
Piotr Dulikowski
8ebd67e1c3 Merge '[Backport 2025.2] batchlog_manager: abort replay of a failed batch on shutdown or node down' from Scylladb[bot]
When replaying a failed batch and sending the mutation to all replicas, make the write response handler cancellable and abort it on shutdown or if some target is marked down. also set a reasonable timeout so it gets aborted if it's stuck for some other unexpected reason.

Previously, the write response handler is not cancellable and has no timeout. This can cause a scenario where some write operation by the batchlog manager is stuck indefinitely, and node shutdown gets stuck as well because it waits for the batchlog manager to complete, without aborting the operation.

backport to relevant versions since the issue can cause node shutdown to hang

Fixes scylladb/scylladb#24599

- (cherry picked from commit 8d48b27062)

- (cherry picked from commit fc5ba4a1ea)

- (cherry picked from commit 7150632cf2)

- (cherry picked from commit 74a3fa9671)

- (cherry picked from commit a9b476e057)

- (cherry picked from commit d7af26a437)

Parent PR: #24595

Closes scylladb/scylladb#24880

* github.com:scylladb/scylladb:
  test: test_batchlog_manager: batchlog replay includes cdc
  test: test_batchlog_manager: test batch replay when a node is down
  batchlog_manager: set timeout on writes
  batchlog_manager: abort writes on shutdown
  batchlog_manager: create cancellable write response handler
  storage_proxy: add write type parameter to mutate_internal
2025-07-08 15:39:58 +02:00
Michael Litvak
a26d8f72b6 test: test_batchlog_manager: batchlog replay includes cdc
Add a new test that verifies that when replaying batch mutations from
the batchlog, the mutations include cdc augmentation if needed.

This is done in order to verify that it works currently as expected and
doesn't break in the future.

(cherry picked from commit d7af26a437)
2025-07-08 06:25:03 +00:00
Michael Litvak
9012357a4b test: test_batchlog_manager: test batch replay when a node is down
Add a test of the batchlog manager replay loop applying failed batches
while some replica is down.

The test reproduces an issue where the batchlog manager tries to replay
a failed batch, doesn't get a response from some replica, and becomes
stuck.

It verifies that the batchlog manager can eventually recover from this
situation and continue applying failed batches.

(cherry picked from commit a9b476e057)
2025-07-08 06:25:03 +00:00
Michael Litvak
8fa2520e15 batchlog_manager: set timeout on writes
Set a timeout on writes of replayed batches by the batchlog manager.

We want to avoid having infinite timeout for the writes in case it gets
stuck for some unexpected reason.

The timeout is set to be high enough to allow any reasonable write to
complete.

(cherry picked from commit 74a3fa9671)
2025-07-08 06:25:03 +00:00
Michael Litvak
ba11e8ebdd batchlog_manager: abort writes on shutdown
On shutdown of batchlog manager, abort all writes of replayed batches
by the batchlog manager.

To achieve this we set the appropriate write_type to BATCH, and on
shutdown cancel all write handlers with this type.

(cherry picked from commit 7150632cf2)
2025-07-08 06:25:02 +00:00
Michael Litvak
4e2b587b4d batchlog_manager: create cancellable write response handler
When replaying a batch mutation from the batchlog manager and sending it
to all replicas, create the write response handler as cancellable.

To achieve this we define a new wrapper type for batchlog mutations -
batchlog_replay_mutation, and this allows us to overload
create_write_response_handler for this type. This is similar to how it's
done with hint_wrapper and read_repair_mutation.

(cherry picked from commit fc5ba4a1ea)
2025-07-08 06:25:02 +00:00
Michael Litvak
f8b4d1c1cd storage_proxy: add write type parameter to mutate_internal
Currently mutate_internal has a boolean parameter `counter_write` that
indicates whether the write is of counter type or not.

We replace it with a more general parameter that allows to indicate the
write type.

It is compatible with the previous behavior - for a counter write, the
type COUNTER is passed, and otherwise a default value will be used
as before.

(cherry picked from commit 8d48b27062)
2025-07-08 06:25:02 +00:00
Gleb Natapov
71f59e046b topology coordinator: log a start and an end of topology coordinator command execution at info level
Those calls a relatively rare and the output may help to analyze issues
in production.

(cherry picked from commit 4e6369f35b)
2025-07-08 06:23:48 +00:00
Gleb Natapov
ad91198417 topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc
The topology coordinator executes several topology cmd rpc against some nodes
during a topology change. A topology operation will not proceed unless
rpc completes (successfully or not), but sometimes it appears that it
hangs and it is hard to tell on which nodes it did not complete yet.
Introduce new REST endpoint that can help with debugging such cases.
If executed on the topology coordinator it returns currently running
topology rpc (if any) and a list of nodes that did not reply yet.

(cherry picked from commit c8ce9d1c60)
2025-07-08 06:23:48 +00:00
Marcin Maliszkiewicz
8e783cc23a test: add test for live updates of permissions cache config
(cherry picked from commit dd01852341)
2025-07-07 10:07:02 +02:00
Marcin Maliszkiewicz
b1e75eba65 main: don't start maintenance auth service if not enabled
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).

When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.

(cherry picked from commit 97c60b8153)
2025-07-07 10:06:28 +02:00
Patryk Jędrzejczak
370165cca5 docs: handling-node-failures: fix typo
Replacing "from" is incorrect. The typo comes from recently
merged #24583.

Fixes #24732

Requires backport to 2025.2 since #24583 has been backported to 2025.2.

Closes scylladb/scylladb#24733

(cherry picked from commit fa982f5579)

Closes scylladb/scylladb#24831
2025-07-04 19:32:09 +02:00
Gleb Natapov
24a317460d gossiper: do not assume that id->ip mapping is available in failure_detector_loop_for_node
failure_detector_loop_for_node may be started on a shard before id->ip
mapping is available there. Currently the code treats missing mapping
as an internal error, but it uses its result for debug output only, so
lets relax the code to not assume the mapping is available.

Fixes #23407

Closes scylladb/scylladb#24614

(cherry picked from commit a221b2bfde)

Closes scylladb/scylladb#24768
2025-07-04 16:09:50 +02:00
Michał Chojnowski
08b117425e utils/alien_worker: fix a data race in submit()
We move a `seastar::promise` on the external worker thread,
after the matching `seastar::future` was returned to the shard.

That's illegal. If the `promise` move occurs concurrently with some
operation (move, await) on the `future`, it becomes a data race
which could cause various kinds of corruption.

This patch fixes that by keeping the promise at a stable address
on the shard (inside a coroutine frame) and only passing through
the worker.

Fixes #24751

Closes scylladb/scylladb#24752

(cherry picked from commit a29724479a)

Closes scylladb/scylladb#24777
2025-07-03 11:02:30 +03:00
Łukasz Paszkowski
cf36de2c9a compaction_manager: cancel submission timer on drain
The `drain` method, cancels all running compactions and moves the
compaction manager into the disabled state. To move it back to
the enabled state, the `enable` method shall be called.

This, however, throws an assertion error as the submission time is
not cancelled and re-enabling the manager tries to arm the armed timer.

Thus, cancel the timer, when calling the drain method to disable
the compaction manager.

Fixes https://github.com/scylladb/scylladb/issues/24504

All versions are affected. So it's a good candidate for a backport.

Closes scylladb/scylladb#24505

(cherry picked from commit a9a53d9178)

Closes scylladb/scylladb#24590
2025-07-03 10:19:16 +03:00
Tomasz Grabiec
440985387e Merge '[Backport 2025.2] repair: postpone repair until topology is not busy ' from Scylladb[bot]
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization.

Hence, if:
- topology is in the tablet_resize_finalization state;
- repair starts (as there is no tablet transitions) and holds the erm;
- resize finalization finishes;

then the repair sees a topology state different than the actual -
it does not see that the storage groups were already split.
Repair code does not handle this case and it results with
on_internal_error.

Start repair when topology is not busy. The check isn't atomic,
as it's done on a shard 0. Thus, we compare the topology versions
to ensure that the business check is valid.

Fixes: https://github.com/scylladb/scylladb/issues/24195.

Needs backport to all branches since they are affected

- (cherry picked from commit df152d9824)

- (cherry picked from commit 83c9af9670)

Parent PR: #24202

Closes scylladb/scylladb#24781

* github.com:scylladb/scylladb:
  test: add test for repair and resize finalization
  repair: postpone repair until topology is not busy
2025-07-02 13:18:22 +02:00
Botond Dénes
e8c9a412bc docs: cql/types.rst: remove reference to frozen-only UDTs
ScyllaDB supports non-frozen UDTs since 3.2, no need to keep referencing
this limitation in the current docs. Replace the description of the
limitation with general description of frozen semantics for UDTs.

Fixes: #22929

Closes scylladb/scylladb#24763

(cherry picked from commit 37ef9efb4e)

Closes scylladb/scylladb#24782
2025-07-02 12:10:35 +03:00
Ferenc Szili
f24d71ab8c logging: Add row count to large partition warning message
When writing large partitions, that is: partitions with size or row count
above a configurable threshold, ScyllaDB outputs a warning to the log:

WARN ... large_data - Writing large partition test/test:  (1200031 bytes) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db

This warning contains the information about the size of the partition,
but it does not contain the number of rows written. This can lead to
confusion because in cases where the warning was written because of the
row count being larger than the threshold, but the partition size is below
the threshold, the warning will only contain the partition size in bytes,
leading the user to believe the warning was output because of the
partition size, when in reality it was the row count that triggered the
warning. See #20125

This change adds a size_desc argument to cql_table_large_data_handler::try_record(),
which will contain the description of the size of the object written.
This method is used to output warnings for large partitions, row counts,
row sizes and cell sizes. This change does not modify the warning message
for row and cell sizes, only for partition size and row count.

The warning for large partitions and row counts will now look like this:

WARN ... large_data - Writing large partition test/test:  (1200031 bytes/100001 rows) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db

Closes scylladb/scylladb#22010

(cherry picked from commit 96267960f8)

Closes scylladb/scylladb#24685
2025-07-02 11:22:40 +03:00
Botond Dénes
3560d9ad82 Merge '[Backport 2025.2] sstables: purge SCYLLA_ASSERT from the sstable read/parse paths' from Scylladb[bot]
Introduce `sstables::parse_assert()`, to replace `SCYLLA_ASSERT()` on the read/parse path. SSTables can get corrupt for various reasons, some outside of the database's control. A bad SSTable should not bring down the database, the parsing should simply be aborted, with as much information printed as possible for the investigation of the nature of the corruption. The newly introduced `parse_assert()` uses `on_internal_error()` under the hood, which prints a backtrace and optionally allows for aborting when on the error, to generate a coredump.

Fixes https://github.com/scylladb/scylladb/issues/20845

We just hit another case of `SCYLLA_ASSERT()` triggering due to corrupt sstables bringing down nodes in the field, should be backported to all releases, so we don't hit this in the future

- (cherry picked from commit 27e26ed93f)

- (cherry picked from commit bce89c0f5e)

Parent PR: #24534

Closes scylladb/scylladb#24686

* github.com:scylladb/scylladb:
  sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path
  sstables/exceptions: introduce parse_assert()
2025-07-02 11:20:48 +03:00
Botond Dénes
9dd8dc5357 Merge '[Backport 2025.2] Fix for cassandra role gets recreated after DROP ROLE' from Scylladb[bot]
This patchset fixes regression introduced by 7e749cd848 when we started re-creating default superuser role and password from the config, even if new custom superuser was created by the user.

Now we'll check, first with CL LOCAL_ONE if there is a need to create default superuser role or password, confirm
it with CL QUORUM and only then atomically create role or password.

If server is started without cluster quorum we'll skip creating role or password.

Fixes https://github.com/scylladb/scylladb/issues/24469
Backport: all versions since 2024.2

- (cherry picked from commit 68fc4c6d61)

- (cherry picked from commit c96c5bfef5)

- (cherry picked from commit 2e2ba84e94)

- (cherry picked from commit f85d73d405)

- (cherry picked from commit d9ec746c6d)

- (cherry picked from commit a3bb679f49)

- (cherry picked from commit 67a4bfc152)

- (cherry picked from commit 0ffddce636)

- (cherry picked from commit 5e7ac34822)

Parent PR: #24451

Closes scylladb/scylladb#24694

* github.com:scylladb/scylladb:
  test: auth_cluster: add test for password reset procedure
  auth: cache roles table scan during startup
  test: auth_cluster: add test for replacing default superuser
  test: pylib: add ability to specify default authenticator during server_start
  auth: split auth-v2 logic for adding default superuser password
  auth: split auth-v2 logic for adding default superuser role
  auth: ldap: fix waiting for underlying role manager
  auth: wait for default role creation before starting authorizer and authenticator
2025-07-02 10:40:56 +03:00
Botond Dénes
d4d464e21e tools/scylla-nodetool: backup: add --move-files parameter
Allow opting in for backup to move the files instead of copying them.

Fixes: https://github.com/scylladb/scylladb/issues/24372

Closes scylladb/scylladb#24503

(cherry picked from commit e715a150b9)

Closes scylladb/scylladb#24721
2025-07-02 10:40:04 +03:00
Botond Dénes
9816cdb901 Merge '[Backport 2025.2] mutation: check key of inserted rows' from Scylladb[bot]
Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations.

Fixes: https://github.com/scylladb/scylladb/issues/24506

Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters.

- (cherry picked from commit 8b756ea837)

- (cherry picked from commit ab96c703ff)

Parent PR: #24497

Closes scylladb/scylladb#24742

* github.com:scylladb/scylladb:
  mutation: check key of inserted rows
  compound: optimize is_full() for single-component types
2025-07-02 10:27:13 +03:00
Avi Kivity
3d3976c862 repair: row_level: unstall to_repair_rows_on_wire() destroying its input
to_repair_rows_on_wire() moves the contents of its input std::list
and is careful to yield after each element, but the final destruction
of the input list still deals with all of the list elements without
yielding. This is expensive as not all contents of repair_row are moved
(_dk_with_hash is of type lw_shared_ptr<const decorated_key_with_hash>).

To fix, destroy each row element as we move along. This is safe as we
own the input and don't reference row_list other than for the iteration.

Fixes #24725.

Closes scylladb/scylladb#24726

(cherry picked from commit 6aa71205d8)

Closes scylladb/scylladb#24769
2025-07-02 10:06:30 +03:00
Botond Dénes
7786167998 Merge '[Backport 2025.2] encryption_at_rest_test: Fix some spurious errors' from Scylladb[bot]
Fixes #24574

* Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert
* Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free.

- (cherry picked from commit ee98f5d361)

- (cherry picked from commit 8d37e5e24b)

Parent PR: #24633

Closes scylladb/scylladb#24770

* github.com:scylladb/scylladb:
  encryption_at_rest_test: Add exception handler to ensure proxy stop
  encryption: Ensure stopping timers in provider cache objects
2025-07-02 10:03:52 +03:00
Aleksandra Martyniuk
cbce0ed911 test: add test for repair and resize finalization
Add test that checks whether repair does not start if there is an
ongoing resize finalization.

(cherry picked from commit 83c9af9670)
2025-07-01 20:26:21 +00:00
Aleksandra Martyniuk
eb96ef8ce7 repair: postpone repair until topology is not busy
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization. This may cause
a data race and unexpected behavior.

Start repair when topology is not busy.

(cherry picked from commit df152d9824)
2025-07-01 20:26:21 +00:00
Patryk Jędrzejczak
e3952fbd35 test: test_raft_recovery_user_data: disable hinted handoff
The test is currently flaky, writes can fail with "Too many in flight
hints: 10485936". See scylladb/scylladb#23565 for more details.

We suspect that scylladb/scylladb#23565 is caused by an infrastructure
issue - slow disks on some machines we run CI jobs on.

Since the test fails often and investigation doesn't seem to be easy,
we first deflake the test in this patch by disabling hinted handoff.

For replacing nodes, we provide `cfg` because there should have been
`cfg` in the first place. The test was correct anyway because:
- `tablets_mode_for_new_keyspaces` is set to `true` by default in
  test/cluster/suite.yaml,
- `endpoint_snitch` is set to `GossipingPropertyFileSnitch` by default
  if the property file is provided in `ScyllaServer.__init__`.

Ref scylladb/scylladb#23565

We should backport this patch to 2025.2 because this test is also flaky
on CI jobs using 2025.2. Older branches don't have this test.

Closes scylladb/scylladb#24364

(cherry picked from commit 8756c233e0)

Fixes #24756

Closes scylladb/scylladb#24757
2025-07-01 19:14:22 +02:00
Calle Wilund
9a60a2adce encryption_at_rest_test: Add exception handler to ensure proxy stop
If boost test is run such that we somehow except even in a test macro
such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy
used, causing a use after free.

(cherry picked from commit 8d37e5e24b)
2025-07-01 15:12:50 +00:00
Calle Wilund
3380479455 encryption: Ensure stopping timers in provider cache objects
utils::loading cache has a timer that can, if we're unlucky, be runnnig
while the encryption context/extensions referencing the various host
objects containing them are destroyed in the case of unit testing.

Add a stop phase in encryption context shutdown closing the caches.

(cherry picked from commit ee98f5d361)
2025-07-01 15:12:50 +00:00
Anna Stuchlik
0d7d983133 doc: extend 2025.2 upgrade with a note about consistent topology updates
This commit adds a note that the user should enable consistent topology updates before upgrading
to 2025.2 if they didn't do it (for some reason) when previously upgrading to version 2025.1.

Fixes https://github.com/scylladb/scylladb/issues/24467

Closes scylladb/scylladb#24468

(cherry picked from commit e2b7302183)

Closes scylladb/scylladb#24523
2025-07-01 12:33:31 +03:00
Avi Kivity
dd509b9513 Merge '[Backport 2025.2] memtable: ensure _flushed_memory doesn't grow above total_memory' from Scylladb[bot]
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

Fixes scylladb/scylladb#21413

Backport candidate because it fixes a crash that can happen in existing stable branches.

- (cherry picked from commit 7d551f99be)

- (cherry picked from commit 975e7e405a)

Parent PR: #21638

Closes scylladb/scylladb#24604

* github.com:scylladb/scylladb:
  memtable: ensure _flushed_memory doesn't grow above total memory usage
  replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
2025-07-01 12:31:25 +03:00
Avi Kivity
9ccc96bf05 tools: optimized_clang: make it work in the presence of a scylladb profile
optimized_clang.sh trains the compiler using profile-guided optimization
(pgo). However, while doing that, it builds scylladb using its own profile
stored in pgo/profiles and decompressed into build/profile.profdata. Due
to the funky directory structure used for training the compiler, that
path is invalid during the training and the build fails.

The workaround was to build on a cloud machine instead of a workstation -
this worked because the cloud machine didn't have git-lfs installed, and
therefore did not see the stored profile, and the whole mess was averted.

To make this work on a machine that does have access to stored profiles,
disable use of the stored profile even if it exists.

Fixes #22713

Closes scylladb/scylladb#24571

(cherry picked from commit 52f11e140f)

Closes scylladb/scylladb#24621
2025-07-01 12:30:58 +03:00
Michał Chojnowski
50736e9740 test_sstable_compression_dictionaries_basic.py: fix a flaky check
test_dict_memory_limit trains new dictionaries and checks (via metrics)
that the old dictionaries are appropriately cleaned up.
The problem is that the cleanup is asynchronous (because the lifetimes
are handled by foreign_ptr, which sends the destructor call
to the owner shard asynchronously), so the metrics might be
checked a few milliseconds before the old dictionary is cleaned up.

The dict lifetimes are lazy on purpose, the right thing to do is
to just let the test retry the check.

Fixes scylladb/scylladb#24516

Closes scylladb/scylladb#24526

(cherry picked from commit cace55aaaf)

Closes scylladb/scylladb#24653
2025-07-01 12:30:25 +03:00
Avi Kivity
ee733c4d38 Merge '[Backport 2025.2] generic_server: fix connections semaphore config observer' from Scylladb[bot]
In ed3e4f33fd we introduced new connection throttling feature which is controlled by uninitialized_connections_semaphore_cpu_concurrency config. But live updating of it was broken, this patch fixes it.

When the temporary value from observer() is destroyed, it disconnects from updateable_value, so observation stops right away. We need to retain the observer.

Backport: to 2025.2 where this feature was added
Fixes: https://github.com/scylladb/scylladb/issues/24557

- (cherry picked from commit c6a25b9140)

- (cherry picked from commit 45392ac29e)

- (cherry picked from commit 68ead01397)

Parent PR: #24484

Closes scylladb/scylladb#24679

* github.com:scylladb/scylladb:
  test: add test for live updates of generic server config
  utils: don't allow do discard updateable_value observer
  generic_server: fix connections semaphore config observer
2025-07-01 12:29:53 +03:00
Szymon Malewski
3bac46a18d utils/exceptions.cc: Added check for exceptions::request_timeout_exception in is_timeout_exception function.
It solves the issue, where in some cases a timeout exceptions in CAS operations are logged incorrectly as a general failure.

Fixes #24591

Closes scylladb/scylladb#24619

(cherry picked from commit f28bab741d)

Closes scylladb/scylladb#24687
2025-07-01 12:29:27 +03:00
Lakshmi Narayanan Sreethar
adab525151 utils/big_decimal: fix scale overflow when parsing values with large exponents
The exponent of a big decimal string is parsed as an int32, adjusted for
the removed fractional part, and stored as an int32. When parsing values
like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale
is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32
limit, and since the scale is stored as an int32, it overflows and wraps
around, losing the value.

This patch fixes that the by parsing the exponent as an int64 value and
then adjusting it for the fractional part. The adjusted scale is then
checked to see if it is still within int32 limits before storing. An
exception is thrown if it is not within the int32 limits.

Note that strings with exponents that exceed the int32 range, like
`0.01E2147483650`, were previously not parseable as a big decimal. They
are now accepted if the final adjusted scale fits within int32 limits.
For the above value, unscaled_value = 1 and scale = -2147483648, so it
is now accepted. This is in line with how Java's `BigDecimal` parses
strings.

Fixes: #24581

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#24640

(cherry picked from commit 279253ffd0)

Closes scylladb/scylladb#24692
2025-07-01 12:28:55 +03:00
Botond Dénes
5f45cf1683 test/boost/memtable_test: only inject error for test table
Currently the test indiscriminately injects failures into the flushes of
any table, via the IO extension mechanism. The tests want to check that
the node correctly handles the IO error by self isolating, however the
indiscriminate IO errors can have unintended consequences when they hit
raft, leading to disorderly shutdown and failure of the tests. Testing
raft's resiliency to IO errors if of course worth doing, but it is not
the goal of this particular test, so to avoid the fallout, the IO errors
are limited to the test tables only.

Fixes: https://github.com/scylladb/scylladb/issues/24637

Closes scylladb/scylladb#24638

(cherry picked from commit ee6d7c6ad9)

Closes scylladb/scylladb#24743
2025-07-01 12:28:05 +03:00
Avi Kivity
5e4941a74b Merge '[Backport 2025.2] sstables/mx/writer: handle non-full prefix row keys' from Scylladb[bot]
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.

Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.

Add a full-stack test which checks that rows with bad keys are correctly handled.

Fixes: https://github.com/scylladb/scylladb/issues/24489

The bug is present in all versions, has to be backported to all supported versions.

- (cherry picked from commit 92b5fe8983)

- (cherry picked from commit 0753643606)

- (cherry picked from commit b0d5462440)

- (cherry picked from commit 093d4f8d69)

- (cherry picked from commit 678deece88)

- (cherry picked from commit 64f8500367)

- (cherry picked from commit b931145a26)

- (cherry picked from commit 3e1c50e9a7)

- (cherry picked from commit 46ff7f9c12)

- (cherry picked from commit ebd9420687)

- (cherry picked from commit aae212a87c)

- (cherry picked from commit 592ca789e2)

- (cherry picked from commit edc2906892)

Parent PR: #24492

Closes scylladb/scylladb#24744

* github.com:scylladb/scylladb:
  test/boost/sstable_datafile_test: add test for corrupt data
  sstables/mx/writer: handler rows with empty keys
  test/lib/cql_assertions: introduce columns_assertions
  sstables: add corrupt_data_handler to sstables::sstables
  tools/scylla-sstable: make large_data_handler a local
  db: introduce corrupt_data_handler
  mutation: introduce frozen_mutation_fragment_v2
  mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
  mutation/mutation_partition_view: extract de-ser of {clustering,static} row
  idl-compiler.py: generate skip() definition for enums serializers
  idl: extract full_position.idl from position_in_partition.idl
  db/system_keyspace: add apply_mutation()
  db/system_keyspace: introduce the corrupt_data table
2025-07-01 12:27:01 +03:00
Gleb Natapov
31ed717afb storage_proxy: retry paxos repair even if repair write succeeded
After paxos state is repaired in begin_and_repair_paxos we need to
re-check the state regardless if write back succeeded or not. This
is how the code worked originally but it was unintentionally changed
when co-routinized in 61b2e41a23.

Fixes #24630

Closes scylladb/scylladb#24651

(cherry picked from commit 5f953eb092)

Closes scylladb/scylladb#24703
2025-07-01 10:15:12 +02:00