Commit Graph

50466 Commits

Author SHA1 Message Date
Raphael S. Carvalho
9b6ff15a72 replica: Wire add_new_sstable_and_update_cache() into file streaming consumer
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit ddb27488fa)
2026-05-06 16:14:08 +03:00
Raphael S. Carvalho
4bdf70440f replica: Wire add_new_sstable_and_update_cache() into streaming consumer
After the wiring, failure to attach the new sstable in the streaming
consumer will unlink the sstable automatically.

Fixes #27414.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 10225ee434)
2026-05-06 16:14:08 +03:00
Raphael S. Carvalho
0de4fdd060 replica: Document old add_sstable_and_update_cache() variants
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit a72025bbf6)
2026-05-06 16:14:08 +03:00
Raphael S. Carvalho
2f42645569 replica: Introduce add_new_sstables_and_update_cache()
Piggyback on new add_new_sstable_and_update_cache(), replacing
the previous add_sstables_and_update_cache().

Will be used by intra-node migration since we want it to be
safe when loading the cloned sstables. An unsplit sstable
can be cloned into destination which already ACKed split,
so we need this variant which splits sstable if needed,
while it's unsealed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 3f8363300a)
2026-05-06 16:14:08 +03:00
Raphael S. Carvalho
088f04f4c0 replica: Introduce add_new_sstable_and_update_cache()
Failure to load sstable in streaming can leave sealed sstables
on disk since they're not unlinked on failure.

This can result in several problems:
1) Data resurrection: since the sstable may contain deleted data
2) Split issue: since the finalization requires all sstables to be split
3) Disk usage issue: since the sstables hold space and streaming retries
can keep accumulating these files.

This new procedure will be later wired into streaming consumers, in
order to fix those problems.

Another benefit of the interface is that if there's split when adding
the new sstable, the output sstables will be returned to the caller,
allowing them to register the actual loaded sstables into e.g.
the view builder.

Refs #27414.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 63d1d6c39b)
2026-05-06 16:14:08 +03:00
Raphael S. Carvalho
702d524145 replica: Account for sstables being added before ACKing split
We want the invariant that after ACK, all sealed sstables will be split.
If check-and-attach is not atomic, this sequence is possible:

1) no split decision set.
2) Unsplit sstable is checked, no need to split, sealed.
3) split decision is set and ACKed
4) unsplit sstable is attached

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 27d460758f)
2026-05-06 16:14:07 +03:00
Raphael S. Carvalho
df200d4f84 replica: Remove repair read lock from maybe_split_new_sstable()
The lock is intended to serialize some maintenance compactions,
such as major, with repair. But maybe_split_new_sstable() is
restricted solely to new sstables that aren't part of the
sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 794e03856a)
2026-05-06 16:14:07 +03:00
Raphael S. Carvalho
bdecb0ccde compaction: Preserve state of input sstable in maybe_split_new_sstable()
This is crucial with MVs, since the splitting must preserve the state of
the original sstable. We want the sstable to be in staging dir, so it's
excluded when calculating the diff for performing pushes to view
replicas.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 2dae0a7380)
2026-05-06 16:14:07 +03:00
Raphael S. Carvalho
3b0dbe4d3b Rename maybe_split_sstable() to maybe_split_new_sstable()
Since the function must only be used on new sstables, it should
be renamed to something describing its usage should be restricted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1fdc410e24)
2026-05-06 16:14:07 +03:00
Raphael S. Carvalho
6de7d95715 sstables: Allow storage::snapshot() to leave destination sstable unsealed
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 1a077a80f1)
2026-05-06 16:14:07 +03:00
Raphael S. Carvalho
c2b974d87b sstables: Add option to leave sstable unsealed in the stream sink
That will be needed for file streaming to leave output sstable unsealed.

we want the invariant where all sealed sstables are split after split
was ACKed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit c5e840e460)
2026-05-06 15:47:56 +03:00
Raphael S. Carvalho
69e468f599 test: Verify unsealed sstable can be compacted
This is crucial for splitting before sealing the sstable produced by
repair. This way, unsplit sstables won't be left on disk sealed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit c10486a5e9)
2026-05-06 15:47:56 +03:00
Raphael S. Carvalho
88657f08b9 sstables: Allow unsealed sstable to be loaded
File streaming will have to load an unsealed sstable, so we need
to be able to parse components from temporary TOC instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit ab82428228)
2026-05-06 15:47:55 +03:00
Raphael S. Carvalho
b31191d92e sstables: Restore sstable_writer_config::leave_unsealed
This option was retired in commit 0959739216, but
it will be again needed in order to implement split before sealing.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit b1be4ba2fc)
2026-05-06 15:47:55 +03:00
Botond Dénes
8e1ecd4feb Merge 'test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces' from Dawid Mędrek
The test was flaky. The scenario looked like this:

1. Stop server 1.
2. Set its rf_rack_valid_keyspaces configuration option to true.
3. Create an RF-rack-invalid keyspace.
4. Start server 1 and expect a failure during start-up.

It was wrong. We cannot predict when the Raft mutation corresponding to
the newly created keyspace will arrive at the node or when it will be
processed. If the check of the RF-rack-valid keyspaces we perform at
start-up was done before that, it won't include the keyspace. This will
lead to a test failure.

Unfortunately, it's not feasible to perform a read barrier during
start-up. What's more, although it would help the test, it wouldn't be
useful otherwise. Because of that, we simply fix the test, at least for
now.

The new scenario looks like this:

1. Disable the rf_rack_valid_keyspaces configuration option on server 1.
2. Start the server.
3. Create an RF-rack-invalid keyspace.
4. Perform a read barrier on server 1. This will ensure that it has
   observed all Raft mutations, and we won't run into the same problem.
5. Stop the node.
6. Set its rf_rack_valid_keyspaces configuration option to true.
7. Try to start the node and observe a failure.

This will make the test perform consistently.

---

I ran the test (in dev mode, on my local machine) three times before
these changes, and three times with them. I include the time results
below.

Before:
```
real    0m47.570s
user    0m41.631s
sys     0m8.634s

real    0m50.495s
user    0m42.499s
sys     0m8.607s

real    0m50.375s
user    0m41.832s
sys     0m8.789s
```

After:
```
real    0m50.509s
user    0m43.535s
sys     0m9.715s

real    0m50.857s
user    0m44.185s
sys     0m9.811s

real    0m50.873s
user    0m44.289s
sys     0m9.737s
```

Fixes SCYLLADB-1137

Backport: The test is present on all supported branches, and so we
          should backport these changes to them.

Closes scylladb/scylladb#29218

* github.com:scylladb/scylladb:
  test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces
  test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py

(cherry picked from commit d52fbf7ada)

Closes scylladb/scylladb#29670
2026-05-06 10:10:47 +02:00
Tomasz Grabiec
2720a197d0 test: pylib: Ignore exceptions in wait_for()
ManagerClient::get_ready_cql() calls server_sees_others(), which waits
for servers to see each other as alive in gossip. If one of the
servers is still early in boot, RESTful API call to
"gossiper/endpoint/live" may fail. It throws an exception, which
currently terminates the wait_for() and propagates up, failing the test.

Fix this by ignoring errors when polling inside wait_for. In case of
timeout, we log the last exception. This should fix the problem not
only in this case, for all uses of wait_for().

Example output:

```
pred = <function ManagerClient.server_sees_others.<locals>._sees_min_others at 0x7f022af9a140>
deadline = 1775218828.9172852, period = 1.0, before_retry = None
backoff_factor = 1.5, max_period = 1.0, label = None

    async def wait_for(
            pred: Callable[[], Awaitable[Optional[T]]],
            deadline: float,
            period: float = 0.1,
            before_retry: Optional[Callable[[], Any]] = None,
            backoff_factor: float = 1.5,
            max_period: float = 1.0,
            label: Optional[str] = None) -> T:
        tag = label or getattr(pred, '__name__', 'unlabeled')
        start = time.time()
        retries = 0
        last_exception: Exception | None = None
        while True:
            elapsed = time.time() - start
            if time.time() >= deadline:
                timeout_msg = f"wait_for({tag}) timed out after {elapsed:.2f}s ({retries} retries)"
                if last_exception is not None:
                    timeout_msg += (
                        f"; last exception: {type(last_exception).__name__}: {last_exception}"
                    )
                    raise AssertionError(timeout_msg) from last_exception
                raise AssertionError(timeout_msg)

            try:
>               res = await pred()

test/pylib/util.py:80:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    async def _sees_min_others():
>       raise Exception("asd")
E       Exception: asd

test/pylib/manager_client.py:802: Exception

The above exception was the direct cause of the following exception:

manager = <test.pylib.manager_client.ManagerClient object at 0x7f022af7e7b0>

    @pytest.mark.asyncio
    async def test_auth_after_reset(manager: ManagerClient) -> None:
        servers = await manager.servers_add(3, config=auth_config, auto_rack_dc="dc1")
>       cql, _ = await manager.get_ready_cql(servers)

test/cluster/auth_cluster/test_auth_after_reset.py:33:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/manager_client.py:137: in get_ready_cql
    await self.servers_see_each_other(servers)
test/pylib/manager_client.py:820: in servers_see_each_other
    await asyncio.gather(*others)
test/pylib/manager_client.py:806: in server_sees_others
    await wait_for(_sees_min_others, time() + interval, period=.5)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

pred = <function ManagerClient.server_sees_others.<locals>._sees_min_others at 0x7f022af9a140>
deadline = 1775218828.9172852, period = 1.0, before_retry = None
backoff_factor = 1.5, max_period = 1.0, label = None

    async def wait_for(
            pred: Callable[[], Awaitable[Optional[T]]],
            deadline: float,
            period: float = 0.1,
            before_retry: Optional[Callable[[], Any]] = None,
            backoff_factor: float = 1.5,
            max_period: float = 1.0,
            label: Optional[str] = None) -> T:
        tag = label or getattr(pred, '__name__', 'unlabeled')
        start = time.time()
        retries = 0
        last_exception: Exception | None = None
        while True:
            elapsed = time.time() - start
            if time.time() >= deadline:
                timeout_msg = f"wait_for({tag}) timed out after {elapsed:.2f}s ({retries} retries)"
                if last_exception is not None:
                    timeout_msg += (
                        f"; last exception: {type(last_exception).__name__}: {last_exception}"
                    )
>                   raise AssertionError(timeout_msg) from last_exception
E                   AssertionError: wait_for(_sees_min_others) timed out after 45.30s (46 retries); last exception: Exception: asd

test/pylib/util.py:76: AssertionError
```

Fixes a failure observed in test_auth_after_reset:

```
manager = <test.pylib.manager_client.ManagerClient object at 0x7fb3740e1630>

    @pytest.mark.asyncio
    async def test_auth_after_reset(manager: ManagerClient) -> None:
        servers = await manager.servers_add(3, config=auth_config, auto_rack_dc="dc1")
        cql, _ = await manager.get_ready_cql(servers)
        await cql.run_async("ALTER ROLE cassandra WITH PASSWORD = 'forgotten_pwd'")

        logging.info("Stopping cluster")
        await asyncio.gather(*[manager.server_stop_gracefully(server.server_id) for server in servers])

        logging.info("Deleting sstables")
        for table in ["roles", "role_members", "role_attributes", "role_permissions"]:
            await asyncio.gather(*[manager.server_wipe_sstables(server.server_id, "system", table) for server in servers])

        logging.info("Starting cluster")
        # Don't try connect to the servers yet, with deleted superuser it will be possible only after
        # quorum is reached.
        await asyncio.gather(*[manager.server_start(server.server_id, connect_driver=False) for server in servers])

        logging.info("Waiting for CQL connection")
        await repeat_until_success(lambda: manager.driver_connect(auth_provider=PlainTextAuthProvider(username="cassandra", password="cassandra")))
>       await manager.get_ready_cql(servers)

test/cluster/auth_cluster/test_auth_after_reset.py:50:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/manager_client.py:137: in get_ready_cql
    await self.servers_see_each_other(servers)
test/pylib/manager_client.py:819: in servers_see_each_other
    await asyncio.gather(*others)
test/pylib/manager_client.py:805: in server_sees_others
    await wait_for(_sees_min_others, time() + interval, period=.5)
test/pylib/util.py:71: in wait_for
    res = await pred()
test/pylib/manager_client.py:802: in _sees_min_others
    alive_nodes = await self.api.get_alive_endpoints(server_ip)
test/pylib/rest_client.py:243: in get_alive_endpoints
    data = await self.client.get_json(f"/gossiper/endpoint/live", host=node_ip)
test/pylib/rest_client.py:99: in get_json
    ret = await self._fetch("GET", resource_uri, response_type = "json", host = host,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <test.pylib.rest_client.TCPRESTClient object at 0x7fb2404a0650>
method = 'GET', resource = '/gossiper/endpoint/live', response_type = 'json'
host = '127.15.252.8', port = 10000, params = None, json = None, timeout = None
allow_failed = False

    async def _fetch(self, method: str, resource: str, response_type: Optional[str] = None,
                     host: Optional[str] = None, port: Optional[int] = None,
                     params: Optional[Mapping[str, str]] = None,
                     json: Optional[Mapping] = None, timeout: Optional[float] = None, allow_failed: bool = False) -> Any:
        # Can raise exception. See https://docs.aiohttp.org/en/latest/web_exceptions.html
        assert method in ["GET", "POST", "PUT", "DELETE"], f"Invalid HTTP request method {method}"
        assert response_type is None or response_type in ["text", "json"], \
                f"Invalid response type requested {response_type} (expected 'text' or 'json')"
        # Build the URI
        port = port if port else self.default_port if hasattr(self, "default_port") else None
        port_str = f":{port}" if port else ""
        assert host is not None or hasattr(self, "default_host"), "_fetch: missing host for " \
                "{method} {resource}"
        host_str = host if host is not None else self.default_host
        uri = self.uri_scheme + "://" + host_str + port_str + resource
        logging.debug(f"RESTClient fetching {method} {uri}")

        client_timeout = ClientTimeout(total = timeout if timeout is not None else 300)
        async with request(method, uri,
                           connector = self.connector if hasattr(self, "connector") else None,
                           params = params, json = json, timeout = client_timeout) as resp:
            if allow_failed:
                return await resp.json()
            if resp.status != 200:
                text = await resp.text()
>               raise HTTPError(uri, resp.status, params, json, text)
E               test.pylib.rest_client.HTTPError: HTTP error 404, uri: http://127.15.252.8:10000/gossiper/endpoint/live, params: None, json: None, body:
E               {"message": "Not found", "code": 404}

test/pylib/rest_client.py:77: HTTPError
```

Fixes: SCYLLADB-1381

Closes scylladb/scylladb#29323

(cherry picked from commit 74542be5aa)

Closes scylladb/scylladb#29338

Closes scylladb/scylladb#29713
2026-05-03 11:36:20 +03:00
Botond Dénes
b616af7642 sstables: fix segfault in parse_assert() when message is nullptr
parse_assert() accepts an optional `message` parameter that defaults
to nullptr. When the assertion fails and message is nullptr, it is
implicitly converted to sstring via the sstring(const char*) constructor,
which calls strlen(nullptr) -- undefined behavior that manifests as a
segfault in __strlen_evex.

This turns what should be a graceful malformed_sstable_exception into a
fatal crash. In the case of CUSTOMER-279, a corrupt SSTable triggered
parse_assert() during streaming (in continuous_data_consumer::
fast_forward_to()), causing a crash loop on the affected node.

Fix by guarding the nullptr case with a ternary, passing an empty
sstring() when message is null. on_parse_error() already handles
the empty-message case by substituting "parse_assert() failed".

Fixes: SCYLLADB-1672

Closes scylladb/scylladb#29285

(cherry picked from commit cfebe17592)

Closes scylladb/scylladb#29597

Closes scylladb/scylladb#29720
2026-05-02 16:28:38 +03:00
Wojciech Mitros
50f1982881 view: apply existing range tombstones after exhausting the update reader
When view_update_builder::on_results() hits the path where the update
fragment reader is already exhausted, it still needs to keep tracking
existing range tombstones and apply them to encountered rows.
Otherwise a row covered by an existing range tombstone can appear
alive while generating the view update and create a spurious view row.

Update the existing tombstone state even on the exhausted-reader path
and apply the effective tombstone to clustering rows before generating
the row tombstone update. Add a cqlpy regression test covering the
partition-delete-after-range-tombstone case.

Fixes: SCYLLADB-1649

Closes scylladb/scylladb#29481

(cherry picked from commit 073710a661)

Closes scylladb/scylladb#29649

Closes scylladb/scylladb#29708
2026-05-01 06:23:29 +02:00
Jenkins Promoter
7e801ee72f Update pgo profiles - aarch64 2026-05-01 04:48:33 +03:00
Jenkins Promoter
4c80e0b200 Update pgo profiles - x86_64 2026-05-01 03:50:13 +03:00
Botond Dénes
10a502ea12 test/cluster/test_data_resurrection_in_memtable.py: use explicit CL
The test has expectation w.r.t which write makes it to which nodes:
* inserts make it to all nodes
* delete makes it to all-1 (QUORUM) node

However, this was not expressed with CL, and the default CL=ONE allowed
for some nodes missing the writes and this violating the tests
expectations on what data is persent on which nodes. This resulted on
the test being flaky and failing on the data checks.

Use explicit CL for the ingestion to prevent this.

The improvements to the test introduced in
a8dd13731f was of great help in
investigating this: traces are now available and the check happens after
the data was dumped to logs.

Fixes: SCYLLADB-870
Fixes: SCYLLADB-812
Fixes: SCYLLADB-1102

Closes scylladb/scylladb#29128

(cherry picked from commit 89388510a0)

Closes scylladb/scylladb#29141

Closes scylladb/scylladb#29535
2026-04-30 21:46:28 +03:00
Calle Wilund
ca55b11fd3 memtable_test::memtable_flush_period: Change sleep to use injection signal instead
Fixes: SCYLLADB-942

Adds an injection signal _from_ table::seal_active_memtable to allow us to
reliably wait for flushing. And does so.

Closes scylladb/scylladb#29070

(cherry picked from commit 0013f22374)

Closes scylladb/scylladb#29116

Closes scylladb/scylladb#29534
2026-04-30 12:06:37 +03:00
Michael Litvak
ef3d4e7e47 test/mv/test_mv_staging: wait for cql after restart
Wait for cql on all hosts after restarting a server in the test.

The problem that was observed is that the test restarts servers[1] and
doesn't wait for the cql to be ready on it. On test teardown it drops
the keyspace, trying to execute it on the host that is not ready, and
fails.

Fixes SCYLLADB-1632

Closes scylladb/scylladb#29562

(cherry picked from commit 3468e8de8b)

Closes scylladb/scylladb#29624

Closes scylladb/scylladb#29642
2026-04-30 08:58:09 +03:00
Patryk Jędrzejczak
10065d13e4 Merge '[Backport 2026.1] raft: Await instead of returning future in wait_for_state_change' from Scylladb[bot]
The `try-catch` expression is pretty much useless in its current form. If we return the future, the awaiting will only be performed by the caller, completely circumventing the exception handling.

As a result, instead of handling `raft::request_aborted` with a proper error message, the user will face `seastar::abort_requested_exception` whose message is cryptic at best. It doesn't even point to the root of the problem.

Fixes SCYLLADB-665

Backport: This is a small improvement and may help when debugging, so let's backport it to all supported versions.

- (cherry picked from commit c36623baad)

- (cherry picked from commit e4f2b62019)

- (cherry picked from commit fae71f79c2)

Parent PR: #28624

Closes scylladb/scylladb#28753

* https://github.com/scylladb/scylladb:
  test: raft: Add test_aborting_wait_for_state_change
  raft: Describe exception types for wait_for_state_change and wait_for_leader
  raft: Await instead of returning future in wait_for_state_change

---

Changes made for the backport:
* wait_for_leader is a private member function, so we omit extending
  its description.
* The provided test originally used wait_for_leader. Since it's
  private, we replace it with wait_election_done. It should have
  the same effect, and the test should behave in its original form.

Closes scylladb/scylladb#29594
2026-04-30 08:57:01 +03:00
Tomasz Grabiec
9f042ebf45 Merge 'table: don't create new split compaction groups if main compaction group is disabled' from Ferenc Szili
Fixes a race condition where tablet split can crash the server during truncation.

`truncate_table_on_all_shards()` disables compaction on all existing compaction groups, then later calls `discard_sstables()` which asserts that compaction is disabled. Between these two points, tablet split can call `set_split_mode()`, which creates new compaction groups via `make_empty_group()` — these start with `compaction_disabled_counter == 0`. When `discard_sstables()` checks its assertion, it finds these new groups and fires `on_internal_error`, aborting the server.

In `storage_group::set_split_mode()`, before creating new compaction groups, check whether the main compaction group has compaction disabled. If it does, bail out early and return `false` (not ready). This is safe because the split will be retried once truncation completes and re-enables compaction.

A new regression test `test_split_emitted_during_truncate` reproduces the
exact interleaving using two error injection points:

- **`database_truncate_wait`** — pauses truncation after compaction is disabled but before `discard_sstables()` runs.
- **`tablet_split_monitor_wait`** (new, in `service/storage_service.cc`) — pauses the split monitor at the start of `process_tablet_split_candidate()`.

The test creates a single-tablet table, triggers both operations, uses the injection points to force the problematic ordering, then verifies that truncation completes successfully and the split finishes afterward.

Fixes: SCYLLADB-1035

This needs to be backported to all currently supported version.

Closes scylladb/scylladb#29250

* github.com:scylladb/scylladb:
  test: add test_split_emitted_during_truncate
  table: fix race between tablet split and truncate

(cherry picked from commit 7fe4ae16f0)

Closes scylladb/scylladb#29572
2026-04-29 18:46:43 +02:00
Marcin Maliszkiewicz
f2e458dcb2 Merge 'storage_service: fix REST API races during shutdown and cross-shard forwarding' from Piotr Smaron
REST route removal unregisters handlers but does not wait for requests
that already entered storage_service.  A request can therefore suspend
inside an async operation, restart proceeds to tear the service down,
and the coroutine later resumes against destroyed members such as
_topology_state_machine, _group0, or _sys_ks — a use-after-destruction
bug that surfaces as UBSAN dynamic-type failures (e.g. the crash seen
from topology_state_load()).

Fix this by holding storage_service::_async_gate from the entry
boundary of every externally-triggered async operation so that stop()
drains them before teardown begins.  The gate is acquired in
run_with_api_lock, run_with_no_api_lock, and in individual REST
handlers that bypass those wrappers (reload_raft_topology_state,
mark_excluded, removenode, schema reload, topology-request
waits/abort, cleanup, ring/schema queries, SSTable dictionary
training/publish, and sampling).

Additionally, fix get_ownership() and abort_topology_request() which
forward work to shard 0 but were still referencing the caller-shard's
`this` pointer instead of the destination-shard instance, causing
silent cross-shard access to shard-local state.
Add a cluster regression test that repeatedly exercises the multi-shard
ownership REST path to cover the forwarding fix.

Fixes: SCYLLADB-1687

Should be backported to all branches, the code has been introduced around 2024.1 release.

Closes scylladb/scylladb#29373

* github.com:scylladb/scylladb:
  storage_service: fix shard-0 forwarding in REST helpers
  storage_service: gate REST-facing async operations during shutdown
  storage_service: prepare for async gate in REST handlers

(cherry picked from commit 4043d95810)

Closes scylladb/scylladb#29611

Closes scylladb/scylladb#29668
2026-04-29 11:46:35 +02:00
Wojciech Mitros
2b20bd887b db/view: track range tombstones in update stream during view update building
The view update builder ignored range tombstone changes from the update
stream when there all existing mutation fragments were already consumed.
The old code assumed range tombstones 'remove nothing pre-existing, so
we can ignore it', but this failed to update _update_current_tombstone.
Consequently, when a range delete and an insert within that range appeared
in the same batch, the range tombstone was not applied to the inserted row,
or was applied to a row outside the range that it covered causing it to
incorrectly survive/be deleted in the materialized view.

Fix by handling is_range_tombstone_change() fragments in the update-only
branch, updating _update_current_tombstone so subsequent clustering rows
correctly have the range tombstone applied to them.

Fixes SCYLLADB-1555

Closes scylladb/scylladb#29483

(cherry picked from commit 6011cb8a4c)

Closes scylladb/scylladb#29569

Closes scylladb/scylladb#29643
2026-04-25 18:45:26 +03:00
Wojciech Mitros
6c51548fd6 test/cluster: fix flaky test_hints_consistency_during_replace
The test creates a sync point immediately after writing 100 rows
with CL=ANY, without waiting for pending hint writes to complete.

store_hint() is fire-and-forget: it submits do_store_hint() to a gate
and returns immediately. do_store_hint() updates _last_written_rp only
after writing to the commitlog. If create_sync_point() is called before
all do_store_hint() coroutines complete, the captured replay position
is stale, and await_sync_point() returns DONE before all hints are
replayed, leaving some rows missing.

Fix by waiting for the size_of_hints_in_progress metric to reach zero
before creating the sync point, ensuring all in-flight hint writes have
completed and _last_written_rp is up to date. This follows the same
pattern already used in test_sync_point.

Fixes: SCYLLADB-1709

Closes scylladb/scylladb#29623

(cherry picked from commit 7634d3f7d4)

Closes scylladb/scylladb#29632

Closes scylladb/scylladb#29641
2026-04-25 18:45:02 +03:00
Dario Mirovic
e114fc8175 test: use DROP KEYSPACE IF EXISTS in new_test_keyspace cleanup
The new_test_keyspace context manager in test/cluster/util.py uses
DROP KEYSPACE without IF EXISTS during cleanup. The Python driver
has a known bug (scylladb/python-driver#317) where connection pool
renewal after concurrent node bootstraps causes double statement
execution. The DROP succeeds server-side, but the response is lost
when the old pool is closed. The driver retries on the new pool, and
gets ConfigurationException message "Cannot drop non existing keyspace".

The CREATE KEYSPACE in create_new_test_keyspace already uses IF NOT
EXISTS as a workaround for the same driver bug. This patch applies
the same approach to fix DROP KEYSPACE.

Fixes SCYLLADB-1538

Closes scylladb/scylladb#29487

(cherry picked from commit 40740104ab)

Closes scylladb/scylladb#29568

Closes scylladb/scylladb#29644
2026-04-25 18:44:39 +03:00
Botond Dénes
dfd2507f0d test/cluster/test_incremental_repair: fix flaky do_tablet_incremental_repair_and_ops
The log grep in get_sst_status searched from the beginning of the log
(no from_mark), so the second-repair assertions were checking cumulative
counts across both repairs rather than counts for the second repair alone.

The expected values (sst_add==2, sst_mark==2) relied on this cumulative
behaviour: 1 from the first repair + 1 from the second = 2. This works
when the second repair encounters exactly one unrepaired sstable, but
fails whenever the second repair sees two.

The second repair can see two unrepaired sstables when the 100 keys
inserted before it (via asyncio.gather) trigger a background auto-flush
before take_storage_snapshot runs. take_storage_snapshot always flushes
the memtable itself, so if an auto-flush already split the batch into two
sstables on disk, the second repair's snapshot contains both and logs
"Added sst" twice, making the cumulative count 3 instead of 2.

Fix: take a log mark per-server before each repair call and pass it to
get_sst_status so each check counts only the entries produced by that
repair. The expected values become 1/0/1 and 1/1/1 respectively,
independent of how many sstables happened to exist beforehand.

get_sst_status gains an optional from_mark parameter (default None)
which preserves existing call sites that intentionally grep from the
start of the log.

Fixes: SCYLLADB-1711

Closes scylladb/scylladb#29484

(cherry picked from commit d280517e27)

Closes scylladb/scylladb#29633

Closes scylladb/scylladb#29640
2026-04-25 18:43:47 +03:00
Botond Dénes
04e33d69da Merge 'service: tasks: return successful status if a table was dropped' from Aleksandra Martyniuk
tablet_virtual_task::wait throws if a table on which a tablet operation
was working is dropped.

Treat the tablet operation as successful if a table is dropped.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-494

Needs backport to all live releases

Closes scylladb/scylladb#28933

* github.com:scylladb/scylladb:
  test: add test_tablet_repair_wait_with_table_drop
  service: tasks: return successful status if a table was dropped

(cherry picked from commit 1e41db5948)

Closes scylladb/scylladb#29361
2026-04-24 18:06:09 +03:00
Marcin Maliszkiewicz
5468cd49da Merge 'ldap: fix double-free of LDAPMessage in poll_results()' from Andrzej Jackowski
In the unregistered-ID branch, ldap_msgfree() was called on a result
already owned by an RAII ldap_msg_ptr, causing a double-free on scope
exit. Remove the redundant manual free.

Fixes: SCYLLADB-1344

Backport: 2026.1, 2025.4, 2025.1 - it's a memory corruption, with a one-line fix, so better backport it everywhere.

Closes scylladb/scylladb#29302

* github.com:scylladb/scylladb:
  test: ldap: add regression test for double-free on unregistered message ID
  ldap: fix double-free of LDAPMessage in poll_results()

(cherry picked from commit 895fdb6d29)

Closes scylladb/scylladb#29393

Closes scylladb/scylladb#29454
2026-04-24 18:05:26 +03:00
Botond Dénes
52a3ed4312 Merge 'test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints' from Dawid Mędrek
Before these changes, we would send mutations to the node and
immediately query the metrics to see how many hints had been written.
However, that could lead to random failures of the test: even if the
mutations have finished executing, hints are stored asynchronously, so
we don't have a guarantee they have already been processed.

To prevent such failures, we rewrite the check: we will perform multiple
checks against the metrics until we have confirmed that the hints have
indeed been written or we hit the timeout.

We're generous with the timeout: we give the test 60 seconds. That
should be enough time to avoid flakiness even on super slow machines,
and if the test does fail, we will know something is really wrong.

As a bonus, we improve the test in general too. We explicitly express
the preconditions we rely on, as well as bump the log level. If the
test fails in the future, it might be very difficult do debug it
without this additional information.

Fixes SCYLLADB-1133

Backport: The test is present on all supported branches. To avoid
          running into more failures, we should backport these changes
          to them.

Closes scylladb/scylladb#29191

* github.com:scylladb/scylladb:
  test: cluster: Increase log level in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Await all mutations concurrently in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Specify min_tablet_count in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Use new_test_table in test_write_cl_any_to_dead_node_generates_hints
  test: cluster: Introduce auxiliary function keyspace_has_tablets
  test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints

(cherry picked from commit ffd58ca1f0)

Closes scylladb/scylladb#29246

Closes scylladb/scylladb#29596
2026-04-24 18:04:37 +03:00
Tomasz Grabiec
816526842e test/alternator: stop concurrent-requests test when workers hit limit
`test_limit_concurrent_requests` could create far more tables than intended
because worker threads looped indefinitely and only the probe path terminated
the test. In practice, workers often hit `RequestLimitExceeded` first, but the
test kept running and creating tables, increasing memory pressure and causing
flakiness due to bad_alloc errors in logs.

Fix by replacing the old probe-driven termination with worker-driven
termination. Workers now run until any worker sees
`RequestLimitExceeded`.

Fixes SCYLLADB-1181

Closes scylladb/scylladb#29270

(cherry picked from commit b355bb70c2)

Closes scylladb/scylladb#29292

Closes scylladb/scylladb#29610
2026-04-24 18:03:10 +03:00
Benny Halevy
73d7f9c1fe compaction_manager: fix use-after-free in postponed_compactions_reevaluation()
drain() signals the postponed_reevaluation condition variable to terminate
the postponed_compactions_reevaluation() coroutine but does not await its
completion. When enable() is called afterwards, it overwrites
_waiting_reevalution with a new coroutine, orphaning the old one. During
shutdown, really_do_stop() only awaits the latest coroutine via
_waiting_reevalution, leaving the orphaned coroutine still alive. After
sharded::stop() destroys the compaction_manager, the orphaned coroutine
resumes and reads freed memory (is_disabled() accesses _state).

Fix by introducing stop_postponed_compactions(), awaiting the reevaluation coroutine in
both drain() and stop() after signaling it, if postponed_compactions_reevaluation() is running.
It uses an std::optional<future<>> for _waiting_reevalution and std::exchange to leave
_waiting_reevalution disengaged when postponed_compactions_reevaluation() is not running.
This prevents a race between drain() and stop().

While at it, fix typo in _waiting_reevalution -> _waiting_reevaluation.

Fixes: SCYLLADB-1600
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#29443

(cherry picked from commit 05a00fe140)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#29527

Closes scylladb/scylladb#29629
2026-04-24 18:02:31 +03:00
Ernest Zaslavsky
116a2f43ee sstables_loader: prevent use-after-free on table drop during streaming
sstables_loader::load_and_stream holds a replica::table& reference via
the sstable_streamer for the entire streaming operation.  If the table
is dropped concurrently (e.g. DROP TABLE or DROP KEYSPACE), the
reference becomes dangling and the next access crashes with SEGV.

This was observed in a longevity-50gb-12h-master test run where a
keyspace was dropped while load_and_stream was still streaming SSTables
from a previous batch.

Fix by acquiring a stream_in_progress() phaser guard in load_and_stream
before creating the streamer.  table::stop() calls
_pending_streams_phaser.close() which blocks until all outstanding
guards are released, keeping the table alive for the duration of the
streaming operation.

Fixes: SCYLLADB-1639

Closes scylladb/scylladb#29403

(cherry picked from commit e5e6608f20)

Closes scylladb/scylladb#29558

Closes scylladb/scylladb#29600
2026-04-24 10:33:51 +02:00
Emil Maskovsky
5d1a8b91cd encryption: cover system.raft table in system_info_encryption
Extend system_info_encryption to encrypt system.raft SSTables.
system.raft contains the Raft log, which may hold sensitive user data
(e.g. batched mutations), so it warrants the same treatment as
system.batchlog and system.paxos.

During upgrade, existing unencrypted system.raft SSTables remain
readable. Existing data is rewritten encrypted via compaction, or
immediately via nodetool upgradesstables -a.

Update the operator-facing system_info_encryption description to
mention system.raft and add a focused test that verifies the schema
extension is present on system.raft.

Fixes: CUSTOMER-268

Backport: 2026.1 - closes an encryption-at-rest coverage gap: system.raft may persist sensitive user-originated data unencrypted; backport to the current LTS.

Closes scylladb/scylladb#29242

(cherry picked from commit 91df3795fc)

Closes scylladb/scylladb#29526

Closes scylladb/scylladb#29582
2026-04-23 10:17:20 +02:00
Botond Dénes
c9ee67c85c Merge 'transport: improve memory accounting for big responses and slow network' from Marcin Maliszkiewicz
After obtaining the CQL response, check if its actual size exceeds the initially acquired memory permit. If so, acquire additional semaphore units and adopt them into the permit, ensuring accurate memory accounting for large responses.

Additionally, move the permit into a .then() continuation so that the semaphore units are kept alive until write_message finishes, preventing premature release of memory permit. This is especially important with slow networks and big responses when buffers can accumulate and deplete a node's memory.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1306
Related https://scylladb.atlassian.net/browse/SCYLLADB-740

Backport: all supported versions

Closes scylladb/scylladb#29288

* github.com:scylladb/scylladb:
  transport: add per-service-level pending response memory metric
  transport: hold memory permit until response write completes
  transport: account for response size exceeding initial memory estimate

(cherry picked from commit 86417d49de)

Closes scylladb/scylladb#29410

Closes scylladb/scylladb#29455
2026-04-21 12:38:40 +02:00
Piotr Dulikowski
04d8663052 Merge 'cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race' from Alex Dathskovsky
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically.

This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.

  Test coverage was extended in test/cluster/test_prepare_race.py:

  - reproduces the invalidation-during-prepare window with injection,
  - verifies prepare completes successfully,
  - then invalidates again and executes the same stale client prepared object,
  - confirms the driver transparently re-requests/re-prepares and execution succeeds.

  This change introduces:

  - no behavior change for normal prepare flow besides stronger lifetime guarantees,
  - no new protocol semantics,
  - preserves existing cache invalidation logic,
  - adds explicit cluster-level regression coverage for both the race and driver reprepare path.
  - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage

Fixes: https://github.com/scylladb/scylladb/issues/27657

Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness.

Closes scylladb/scylladb#28952

* github.com:scylladb/scylladb:
  transport/messages: hold pinned prepared entry in PREPARE result
  cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race

(cherry picked from commit d9a277453e)

Closes scylladb/scylladb#29001

Closes scylladb/scylladb#29195
2026-04-20 12:59:53 +02:00
Jenkins Promoter
72cd145990 Update ScyllaDB version to: 2025.4.8 2026-04-17 01:18:47 +03:00
Botond Dénes
41e2c2d1c4 Merge 'tasks: do not fail the wait request if rpc fails' from Aleksandra Martyniuk
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus,  finished request does not imply that a node is removed from
the topology.

Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.

Modify the get_children method to ignore the exception and warn
about the failure instead.

Keep token_metadata_ptr in get_children to prevent topology from changing.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867

Needs backports to all versions

Closes scylladb/scylladb#29035

* github.com:scylladb/scylladb:
  tasks: fix indentation
  tasks: do not fail the wait request if rpc fails
  tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children

(cherry picked from commit 2e47fd9f56)

Closes scylladb/scylladb#29193
2026-04-16 21:57:08 +03:00
Pavel Emelyanov
2d1fdce790 object_storage_endpoint_param: Make it formattable for real
Currently the formatter converts it to json and then tries to emit into
the output context with the "...{{}}" format string. The intent was to
have the "...{<json text>}" output. However, the double curly brace in
format string means "print a curly brace", so the output of the above
formatting is "...{}", literally.

Fix by keeping a single curly brace. The "<json text>" thing will have
its own surrounding curly braces.

Fixes #27718

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #27687

(reworked version of commit a6618f2, the formatter is in db/config.cc)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27733
2026-04-16 21:54:23 +03:00
Pavel Emelyanov
1e0487bd57 table: Add formatter for group_id argument in tablet merge exception message
Fixes: SCYLLADB-1432

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29143

(cherry picked from commit 78f5bab7cf)

Closes scylladb/scylladb#29412

Closes scylladb/scylladb#29453
2026-04-16 10:57:37 +03:00
Michał Chojnowski
18fc2eff31 test: add a missing reconnect_driver in test_sstable_compression_dictionaries_upgrade.py
Need to work around https://github.com/scylladb/python-driver/issues/295,
lest a CQL query fail spuriously after the cluster restart.

Fixes: SCYLLADB-1114

Closes scylladb/scylladb#29118

(cherry picked from commit 6b18d95dec)

Closes scylladb/scylladb#29146

Closes scylladb/scylladb#29366
2026-04-16 10:56:59 +03:00
Botond Dénes
340369a4d4 Merge 'Alternator: add per-table batch latency metrics and test coverage' from Amnon Heiman
This series fixes a metrics visibility gap in Alternator and adds regression coverage.

Until now, BatchGetItem and BatchWriteItem updated global latency histograms but did not consistently update per-table latency histograms. As a result, table-level latency dashboards could miss batch traffic.

It updates the batch read/write paths to compute request duration once and record it in both global and per-table latency metrics.

Add the missing tests, including a metric-agnostic helper and a dedicated per-table latency test that verifies latency counters increase for item and batch operations.

This change is metrics-only (no API/behavior change for requests) and improves observability consistency between global and per-table views.

Fixes #28721

**We assume the alternator per-table metrics exist, but the batch ones are not updated**

Closes scylladb/scylladb#28732

* github.com:scylladb/scylladb:
  test(alternator): add per-table latency coverage for item and batch ops
  alternator: track per-table latency for batch get/write operations

(cherry picked from commit 035aa90d4b)

Closes scylladb/scylladb#29067

Closes scylladb/scylladb#29365
2026-04-16 10:56:16 +03:00
Pavel Emelyanov
9041f70f34 s3: Don't rearm credential timers when credentials are not refreshed
The update_credentials_and_rearm() may get "empty" credentials from
_creds_provider_chain.get_aws_credentials() -- it doesn't throw, but
returns default-initialized value. In that case the expires_at will be
set to time_point::min, and it's probably not a good idea to arm the
refresh timer and, even worse idea, to subtract 1h from it.

Fixes #29056

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29057

(cherry picked from commit 961fc9e041)

Closes scylladb/scylladb#29158

Closes scylladb/scylladb#29364
2026-04-16 10:55:25 +03:00
Nikos Dragazis
7ed772866e scylla_swap_setup: Remove Before=swap.target dependency from swap unit
When a Scylla node starts, the scylla-image-setup.service invokes the
`scylla_swap_setup` script to provision swap. This script allocates a
swap file and creates a swap systemd unit to delegate control to
systemd. By default, systemd injects a Before=swap.target dependency
into every swap unit, allowing other services to use swap.target to wait
for swap to be enabled.

On Azure, this doesn't work so well because we store the swap file on
the ephemeral disk [1] which has network dependencies (`_netdev` mount
option, configured by cloud-init [2]). This makes the swap.target
indirectly depend on the network, leading to dependency cycles such as:

swap.target -> mnt-swapfile.swap -> mnt.mount -> network-online.target
-> network.target -> systemd-resolved.service -> tmp.mount -> swap.target

This patch breaks the cycle by removing the swap unit from swap.target
using DefaultDependencies=no. The swap unit will still be activated via
WantedBy=multi-user.target, just not during early boot.

Although this problem is specific to Azure, this patch applies the fix
to all clouds to keep the code simple.

Fixes #26519.
Fixes SCYLLADB-1257

[1] https://github.com/scylladb/scylla-machine-image/pull/426
[2] https://github.com/canonical/cloud-init/pull/1213#issuecomment-1026065501

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28504

(cherry picked from commit 6d50e67bd2)

Closes scylladb/scylladb#29339

Closes scylladb/scylladb#29354
2026-04-16 10:53:52 +03:00
Andrzej Jackowski
7b97fe4a92 reader_concurrency_semaphore: fix leak workaround
`e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` introduces a protection
against resources that are "made up" of thin air to
`reader_concurrency_semaphore`. If there are more `_resources` than
the `_initial_resources`, it means there is a negative leak, and
`on_internal_error_noexcept` is called. In addition to it,
`_resources` is set to `std::max(_resources, _initial_resources)`.

However, the commit message of `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34`
states the opposite: "The detection also clamps the
_resources to _initial_resources, to prevent any damage".

Before this commit, the protection mechanism doesn't clamp
`_resources` to `_initial_resources` but instead keeps `_resources` high,
possibly even indefinitely growing. This commit changes `std::max` to
`std::min` to make the code behave as intended.

Fixes: SCYLLADB-1014
Refs: SCYLLADB-163

Closes scylladb/scylladb#28982

(cherry picked from commit 9247dff8c2)

Closes scylladb/scylladb#28988

Closes scylladb/scylladb#29196
2026-04-15 11:53:42 +02:00
Jenkins Promoter
ba3b7360e0 Update pgo profiles - aarch64 2026-04-15 04:44:13 +03:00
Jenkins Promoter
06e9ecab9b Update pgo profiles - x86_64 2026-04-15 03:56:26 +03:00